Thursday, January 22, 2009

Java Certification Tips

A new page of Java Certification Tips gives a "cribsheet" of some of the "syntactic niggles" and other commonly overlooked features of Java that could trip up an otherwise moderately experienced programmer when taking the Sun Certified Java Programmer (SCJP) exam. Did you know, for example that:

  • this is invalid Java syntax?
    float f = 2.5;
  • goto is a Java keyword?
  • this line does not assign the value 13 to the variable?:
    int i = 013;
See the page of tips for more information on these and other niggles.

Wednesday, January 21, 2009

Regular expression tutorial: new example

The Java regular expression tutorial section has been updated with a new example of using regular expressions. In this example, we look at how to perform what is sometimes referred to as HTML scraping: pulling out data from an HTML page (or indeed XML document).

The example is located here: HTML scraping with Java regular expressions. As explained in this tutorial, regular expressions are a good candidate for data scraping because they are flexible. Various libraries exist to parse an HTML or XML document and return an object representation of that document. But such libraries are often "too fussy": many if not most web pages actually do not conform strictly to HTML standards. Similarly, many XML parsing libraries are too fussy for real-life RSS feeds, which are often malformed, strictly speaking. Using regular expressions cuts through the fuss.

When should you scrape web pages?

Note that the article focusses on how technically to scrape HTML pages in Java. It doesn't deal with the "political" issue of whether the site in question wants its content scraped in the first place. In general, it is good practice to do the following:

  • find out if he web site in question has an API to provide the data in a more convenient format (and in a format that it would prefer to provide it to you in!)
  • be open about what you're doing: if the web site administrator things you're trying to "hide" something, they may block your IP address
  • respect server resources: if you are retrieving multiple pages, consider putting a thread sleep between fetches