Wednesday, January 21, 2009

Regular expression tutorial: new example

The Java regular expression tutorial section has been updated with a new example of using regular expressions. In this example, we look at how to perform what is sometimes referred to as HTML scraping: pulling out data from an HTML page (or indeed XML document).

The example is located here: HTML scraping with Java regular expressions. As explained in this tutorial, regular expressions are a good candidate for data scraping because they are flexible. Various libraries exist to parse an HTML or XML document and return an object representation of that document. But such libraries are often "too fussy": many if not most web pages actually do not conform strictly to HTML standards. Similarly, many XML parsing libraries are too fussy for real-life RSS feeds, which are often malformed, strictly speaking. Using regular expressions cuts through the fuss.

When should you scrape web pages?

Note that the article focusses on how technically to scrape HTML pages in Java. It doesn't deal with the "political" issue of whether the site in question wants its content scraped in the first place. In general, it is good practice to do the following:

  • find out if he web site in question has an API to provide the data in a more convenient format (and in a format that it would prefer to provide it to you in!)
  • be open about what you're doing: if the web site administrator things you're trying to "hide" something, they may block your IP address
  • respect server resources: if you are retrieving multiple pages, consider putting a thread sleep between fetches

No comments: