The example is located here: HTML scraping with Java regular expressions. As explained in this tutorial, regular expressions are a good candidate for data scraping because they are flexible. Various libraries exist to parse an HTML or XML document and return an object representation of that document. But such libraries are often "too fussy": many if not most web pages actually do not conform strictly to HTML standards. Similarly, many XML parsing libraries are too fussy for real-life RSS feeds, which are often malformed, strictly speaking. Using regular expressions cuts through the fuss.
When should you scrape web pages?
Note that the article focusses on how technically to scrape HTML pages in Java. It doesn't deal with the "political" issue of whether the site in question wants its content scraped in the first place. In general, it is good practice to do the following:
- find out if he web site in question has an API to provide the data in a more convenient format (and in a format that it would prefer to provide it to you in!)
- be open about what you're doing: if the web site administrator things you're trying to "hide" something, they may block your IP address
- respect server resources: if you are retrieving multiple pages, consider putting a thread sleep between fetches
No comments:
Post a Comment