Wednesday, November 4, 2009

When your antivirus nags you worse than your mother: the power of bloatware

If you're looking to rub salt into that wound we call life, few things can top the mighty blue screen of death followed by the severe ticking off that comes from allowing thy computer to be shut down in an improper fashion.

But if there's one thing that comes close, it's got to be Antivirus software. Few other inventions succeed in reducing a squillion-gigahertz processor to the power of a ZX Spectrum. And on the scale of severe tickings off, little can beat the scolding that I received today from the antivirus I had trustily installed not four score and ten days ago.

- "For the last 90 days thou hast willingly deterred from scanning thy device for evil malware," quoth ye antivirus.
- "Evil what?", quoth I.
- "You know-- bad stuff like cookies and worms and shit.".
- "But isn't that your job?"

Alas, my assumptions regarding the autonomy of the antivirus software had proven false. Despite years of research into artificial intelligence and the inordinate number of clock cycles that said software consumed, it would transpire that scanning for evil malware was a task upon which it preferred not to take the initiative.

Thanks be to pointless bloatware.

Tuesday, October 20, 2009

New Section: JNI (Java Native Interface)

The new section on the Java Native Interface will be expanded over the next few weeks. For those not familiar, JNI is the standard framework for calling native code from Java. Thus, JNI effectively allows you to call Windows API functions from Java (or API calls from the operating system in general).

As explained in the section, one caveat of using native code is that the JNI overhead means that writing your code natively may not be as beneficial as anticipated.

Fragmentation

Disk layout:
  • Blue: MyDocuments
  • Red: MS Office (Program Files)
  • Green: Windows\System32

Tuesday, October 6, 2009

When the simple gets complicated: XPath from applets

Those who have read my introduction to XML will be familiar with my mixed feelings on this "technology". In principle, the idea of a standardised, human-readable data format is a handy one. Just in practice, the XML frameworks that are supposed to take the work out of reading this data end up being fundamentally infuriating for some reason.

Of the irritating options available, the XPath API is about the least tedious. Essentially, the idea is that you can call an evaluate() method, passing in a "path" into the XML document by which you refer to elements (data) in the document. Java's standard implementation is stupidly faffy, making you mess around with document builders and configuration exceptions and god knows what other nonesense that you really shouldn't have to care about. But as I say, of the various frustrating options available, it's still about the least frustrating.

At least, until you come to use it in an applet. Picture the scene: my beautiful XML-consuming applet nearing completion and tested locally, I excitedly gmail my colleague to let him know I'll have it up on the web for testing shortly. Then imagine my dismay when the selfsame applet uploaded to the web fails to initialise. Half an hour or so of frustrating debuggery later, it turns out that the culprit is the XPath.evaluate() method, inside whose gubbinery a network fetch is being made for some spurious property on every single attempt to read a piece of data from an XML document. Vive la pluggable architecture.

As I explain on the page, the workaround I found for this was to explicitly set the system property being looked for to its default value via a call to System.setProperty(). For this, alas, the jar must be signed. But I decided that even signing the jar would be less frustrating than re-writing my code to avoid the XPath API. That's how desperate I was by that stage.

With the jar signed, my colleague is now free to accept the warning message about the apocalyptic consequences of running my applet and proceed with testing. Ginger programmer 1, frustrating API nil. For now.

Friday, July 10, 2009

Security and public wireless networks

The other day I wrote of the problem with Social Security Numbers being used as the basis of authentication. I mentioned that the underlying problem was an assumption that security lay somewhere that it didn't.

A recent Fox News article about rogue wireless networks set up by criminals in airports and other public places demonstrates a similar failure to understand where security lies. As pointed out in the article, criminals can easily set up "trojan" networks in places where we would expect a legitimate one to exist.

But from a security point of view, worrying exclusively about these fake networks kind of misses the point. In attempting to make your computer and computer use secure, you should always assume that any network is inherently insecure. It should not matter whether you're connecting via the "official" JFK Airport network or its fake counterpart. The problem isn't that you need to avoid sending confidential details encrypted over the hacker's network, or update your antivirus specifically for connecting to that network. You should always be taking such measures for any network. If you've got your security policy right, then connecting via the hacker's network should be completely safe! You're security should not be relying on a particular network being "safe"; no network carries such a guarantee-- and especially no publicly accessible network (note that even if you had to type in a password to access the airport's network, it's still a public network!).

As well as having a paid-for firewall and antivirus that you keep up to date, you should be taking measures such as always accessing e-mail via an encrypted service, ensuring any financial transactions are made via an encrypted service, heeding warnings from your browser about problems with certifiates, not installing software from untrusted web sites, making sure you e-mail service has built in antivirus, and in any case not opening e-mails from suspicious recipients.

Tuesday, July 7, 2009

Social Security Numbers and "security": a case of misguided assumption

The research on predicting social security numbers published today by Alessandro Acquisti and Ralph Gross from Carnegie Mellon University unfortunately highlights a fairly classic case of something we do all too often: basing security on something that was never secure-- and never really intended to be secure-- in the first place.

Much of the information that we readily pretend is a valid authentication key (such as Mother's Maiden Name, Date of Birth and Post Code-- and indeed Social Security Number) has really always been publicly available information. The parameter that has changed is how financially viable it is for a criminal to access the public records necessary to deduce this "secret" information. The SSN allocation scheme is perfectly well documented, public information, and the scheme clearly has no element of security built into it whatsoever. The historical origins of the scheme are also documented: the scheme has no security now, never did and was never intended to.

So what do we need to do about this? We need to understand where security actually lies, and not pretend that it exists in places where it doesn't. In most cases, the "security" does not currently lie in whether somebody can guess your PIN number, forge your signature, find out your mother's maiden name or guess the last couple of digits of your SSN. Our measures for preventing discovery of these largely unsecret "secrets" are predictably diabolical, and they are thus extremely weak forms of authentication. A transaction that is "authenticated" by an SSN or signature is essentially unauthenticated and the security of that transaction relies on it being quickly reversed in the event of fraud. So long as users, banks and lawmakers all understand this, the situation isn't so dire. The big danger comes when we pretend that there is security and authentication where there really isn't.

Friday, July 3, 2009

Java, XML and XPath

The Javamex site now includes a basic introduction to using XML and XPath in Java. Java provides various means to read XML, but XPath is generally the most practical for moderately-sized XML documents. XPath effectively allows you to treat a document as a file system and refer to elements by their "path" within the document.

Sunday, May 31, 2009

Random numbers in Java

A few updates have been made to the site's section on generating random numbers in Java. Random numbers can crop up in all sorts of applications, and it's worth having a good understanding of them.

Traditionally, the bread-and-butter means of random number generation has been the java.util.Ranom class. The technique it uses (see the section on the java.util.Random algorithm for more details) is still suitable for some very casual applications of random number generation, such as some simple games. But in many cases, you should probably be thinking of moving away from java.util.Random towards a higher quality generator. As discussed in the section, problems with java.util.Random include its low period, biases in the different bits of the numbers generated, and its unsuitability for generating combinations of values. Using a weak random generator can have side effects such as the following:
  • it can skew the results of a test harness that introduces subtle biases in the code path due to the random number generator;
  • when testing the performance of data structures, and in various simulations, a generator such as java.util.Random will not produce a good range of possible combinations of values, giving false results;
  • the series of numbers can be predictable, leading to disasterous results if used as the basis for security (e.g. generating a random encryption key, nonce or session ID).
One interesting technique, published in 2003 by Goerge Marsaglia, is the XORShift generator. This generates medium-quality random numbers in a few simple machine instructions. For example, given a Java long, x, we can generate the next random number as simply as follows (in fact, various combinations of shift values are possible, as discussed in Marsaglia's paper):


x ^= (x << 21);
x ^= (x >>> 35);
x ^= (x << 4);


continually executing these three lines will cycle through all (2^64)-1 possible values of x (i.e. all values of a long except zero) in pseudorandom order. Initialising x with the value of System.nanoTime(), or some other "random" seed, gives us a fast, medium-quality generator suitable for, say, generating random game data. It will make an excellent choice in many J2ME games, for example.

We also consider a Java implementation of the combined generator suggested by the Numerical Recipes authors, which generates numbers within a similar order of execution time as java.util.Random, but with a higher period and quality.

For applications where security depends on the quality and unpredictability of the random numbers generated, Java's SecureRandom class provides cryptographic strength random numbers, though is some 20-30 times slower than the other techniques.

Saturday, May 30, 2009

New article: how to choose a Java collection

A new addition to the Java Collections section of the Javamex site looks at a question that, perhaps unsurprisingly, crops up fairly frequently: which collections class should you use for a given task? The various collection classes provide a powerful means of managing objects and data in memory, but they're so powerful that choosing between them can sometimes be a bit daunting.

The approach that I take is to split the question firstly into two sub-questions. Firstly, what is the general type of structure that you need? That is, how is the data basically going to be organised? Usually, there is not too much confusion between a list and a map, especially if you consider whether or not you need to answer the question "for a given X, what is the Y"? But the difference between a set and a list is sometimes not well understood, or at least, not considered. With a little bit of thought, it is usually possible to decide in advance whether the purpose of your collection is to decide "is something there or not"? The trick is for programmers to remember to ask that question in the first place, and not simply plump for a list regardless.

Then, having decided on a list, map, set or queue, the next subquestion is obviously which particular flavour is required. In the case of a list, the cases where you wouldn't plump for an ArrayList are relatively uncommon and it should be clear in your head if you do choose something else that you have a "special case".

But in the case of maps, sets and queues, there are definitely more "horses for courses". But, especially in the first two, there are essentially two factors to consider: concurrency and ordering. As in the article, when the various choices are presented in tabular form according to these criteria, things become a little easier. As mentioned, a key rule of thumb is to choose the class that provides the minimum features that you require. If you don't need ordering, don't pay for it.

Java queues present a slightly complex set of choices, but in at least some cases-- especially a DelayQueue or SynchronousQueue-- it should be really clear that you need that class for a special case scenario. It should also be borne in mind that the executors framework means that in many common producer-consumer scenarios, you don't explicitly have to deal with the underlying job queue.

As usual, further questions and comments to any of the articles on Javamex are welcome on either this blog or the associated Java forum.

Wednesday, May 27, 2009

Patently stupid...

Like anyone, I have my fair share of stop-the-world-I-want-to-get-off moments. But I really did have to check the calendar to make sure it wasn't April 1st today when I read through IBM's patent application for their "solution for providing real-time validation of text input fields using regular expression evaluation during text entry" (20090132950). Wading through all the verbiage, it really does appear that IBM think they have "invented" evaluating a regular expression inside an onChange event handler.

In terms of this specific patent application, I don't think many programmers are too worried that they're suddenly going to be prosecuted for calling RegExp in the wrong place-- needless to say, plenty of prior art is being cited, in case it were needed.

But the case does leave some other more interesting questions. Why does IBM of all companies think it needs to stoop this low? And how can it help us ditinguish the "bleeding obvious" from the "possibly patentable"?

I'm really at a loss to answer the first question. Possibly we're simply looking at a clerical error on the part of an overenthusiastic junior employee. But for a company that is supposedly at the forefront of computing research and invention, claiming patent rights on what amounts to a chain of JavaScript API calls doesn't really help them uphold their reputation.

In answer to the second question, I'm reminded of a comment by Donald Knuth, that he regularly sees patents granted to "solutions" to problems that he sets as undergraduate homework questions. If "how do you validate input as the user types into a web form?" had been posted as a question on one of the various Internet programming forums (and probably it has...), it would almost certainly have been tagged as "smells like homework". Few programmers, I suspect, would have tagged it as "goodness me how novel-- I think we should patent this".

I really have no idea how in the field of software patents you can concretely separate "significant invention" from "doing the bleeding obvious" (though, as in this case, I can sometimes recognise the latter when I see it!). But in order for a software solution to be patentable, I would at least expect to see:

- evidence of significant empirical research necessary to find the solution
- an invention of an actual algorithm, with perhaps some guage of number of API calls per other instructions/lines of code in order to determine if a new algorithm had actually been invented.

I would also like to see severe financial penalities for cases such as this, where a company is clearly attempting to use its abundant resources to abuse the system. The small developers that the patent system was originally designed to protect really have to think twice before committing to the thousands of dollars per year that it costs to keep a patent going. A company such as IBM really has nothing to loose by continually paying for nonesensical patent applications out of petty cash on the off-chance that at some point, one of them will sucessfully fly over the cuckoo's nest.

In the case of this particular patent application, I really really hope for the sake of human sanity that common sense fianlly prevails...

Saturday, May 23, 2009

Java on your Kindle

It's excellent to see an array of Java programming books now available on the Kindle. Various of these books, which you may not have considered buying due to their cost, are available at a sometimes significantly reduced price on the Kindle.

Of notable interest to Javamex readers will be Brian Goetz's venerable Java Concurrency in Practice, which is really something of a bible for information on the Java Memory Model and the Java 5 concurrency library. As I say in the review, I can pretty much guarantee that if you're writing concurrent code (as many of us either are or will soon have to, not just on the server, but increasingly client-side), then Java Concurrency in Practice will allow you to fix various bugs in your code that you were possibly unaware of (it certainly fixed some in mine!), as well as help you make decisions about how to architecture your program around the Java 5 concurrency utilities.

For a more light-hearted read, but still an extremely enlightening one, Java Puzzlers remains excellent as ever. Josh Bloch's Effective Java got a whole new lease of life when its second edition was published, and its addition to the Kindle repertoire is most welcome.

Also worthy of note are the Core Java books. These, and in particular the second volume of Advanced Features, is not the most succinct of works, but covers various Java topics that you don't readily see explained in detail in other books.

Java for beginners turorial: what would you like to see?

Readers of the Javamex web site may or may not have seen the first pages of the site's Java for beginners tutorial, which covers a number of very basic Java topics such as variables, control structures (for loops, if/else), plus information on Java arrays.

Various topics are going to be added to the tutorial over the coming weeks. But how do you think it should be expanded? Are you just starting out in Java and having trouble with a particular topic? Or are you an experienced Java programmer, but remember having particular trouble with some aspect of learning Java with the tutorials that you used? Any feedback is welcome on this blog entry.

Tuesday, May 12, 2009

The secret life of the Java 'final' keyword

I'll freely admit that an omission that the Javamex site should have addressed sooner is a more explicit discussion of the Java final keyword. Its use for guaranteeing thread-safety is hinted at in some of the articles, but some more explicit information has been long overdue.

As a step towards addressing this gap, the aforementioned article looks at how, as of Java 5, the final keyword has acquired a very important characteristic for thread-safe programming. Most programmers think of final in terms of its impact on program design or presumed optimisation (which, as I'll mention in a moment, is probably a red herring). But its thread-safety guarantee is far more significant: any field declared as final can be safely accessed by another thread once the constructor completes, whereas, subtly, without this or any other thread-safety mechanism, that guarantee does not hold.

The final keyword and optimisation

An issue with the Java final keyword, also criticised by Josh Bloch and Neal Gafter in their excellent Java Puzzlers book, is that it means different things in different places. When applied to a class or method, it means that that class or method cannot be extended or overridden. This has led to a general (false) conception that "final is to do with optimisation", and its importance in thread-safety has been overlooked.

Even when applied to classes and methods, the notion that final is about optimisation is probably false in most cases. The argument appears to stem from languages such as C++, where "compilation" is a one-off process. In such languages, there are optimisations that you can make to method calls and field accesses if you know that the classes and fields in question will never be overridden. But when you're running in a VM, whether a class or method might "potentially" be overridden doens't really matter. What counts is whether it has been overridden at a given moment. It it hasn't at the point of JIT-compilation, and the JVM makes certain optimisations based on that observation, but then later the class/method is overridden, the JVM generally has the luxury (which a C++ compiler or linker generally doesn't) of being able to re-compile.

So when applied to classes and methods (and in fact, local variables inside methods), final is essentially a program design feature. When applied to instance and class variables, final is an important thread-safety mechanism.

Monday, May 11, 2009

CyclicBarrier

A new section of the Javamex web site looks at the CyclicBarrier class. In case you haven't come across it, CyclicBarrier is a construct that makes coordinating threads easier. Unlike CountDownLatch, which is designed for "one-off" coordinated actions, CyclicBarrier is designed to be re-used, so that it is suitable for iterative processes where the participating threads need to repeatedly perform a parallel operation, but periodically "converge" so that their results can be amalgamated.

In the article, we discuss the example of using a CyclicBarrier to coordinate a parallel sort algorithm. Essentially, the sort takes place in three stages. Each of the stages occurs in parallel, but at the end of each stage, a small single-threaded step must take place to amalgamate the results of the previous parallel operation.

Essentially, the code for a worker thread involved in the operation repeatedly carries out an operation and then calls the barrier's await() method. The latter method blocks until all participating threads have also called await() (we sometimes call this "arriving at the barrier"). At that point, CyclicBarrier executes our pre-determined "amalgamation" routine on one of the threads (the last one to arrive at the barrier, in fact), and then relases all the threads to move on to the next stage. Overall, the class takes out a lot of the actual thread coordination work, although, as we discuss in the article, we must still think about regular concurrency issues such as data synchronization and lock contention.

An additional feature of CyclicBarrier which we discuss is that it handles propagation of interruptions to all participating threads. In other words, if any thread involved in the operation is interrupted, then the whole operation will cease once all threads have called the await() method.

Whilst definitely one of the lesser used of the Java 5 concurrency utilities, CyclicBarrier is definitely a useful class that should not be overlooked.

Wednesday, April 29, 2009

Security issue with Adobe Reader

You may not even have realised it, but for some reason, Adobe Reader can run JavaScript. Why on earth is that?, you might be asking. Isn't the point of a PDF file to store a printable document, not to run programs? Well, you'd have thought so.

But it turns out that, for whatever reason, Adobe Reader can run JavaScript. Not only that, but it can run it really badly. So badly, in fact, that is has a vulnerability whereby "an attacker can exploit this issue to execute arbitrary code the the privileges of the user running the application or crash the application, denying service to legitimate users" (SecurityFocus).

To get round this vulnerability, load Adobe Reader (ideally, don't load it by double-clicking on a PDF file that has been sent to you in an e-mail from an unknown person in South Korea...). Then, go to the Edit menu and to the Preferences option. The option you need is hidden away in the section marked "Javascript". Click on this in the list on the left hand side of the preferences menu, then make sure that the option Enable Acrobat JavaScript is not enabled.

Finally, never re-enable JavaScript in Adobe Reader or any other PDF reader application. There are certain features that are necessary in a printable text document reader application such as, well, text and the ability to print. But you really don't need JavaScript in PDF documents!!!!

On a related note, Microsoft have just announced a vulnerability in Notepad which allows a maliciously formatted txt document to accelerate the mutation of swine flu. Users are advised to paint a white cross on their door before launching any text document more than 4 characters in length.

Sunday, April 19, 2009

New content

A few new pages have been added to the site that you may be interested in:
  • The section on Java cryptography now considers password-based encryption, which we not too surprisingly conclude is fraught with difficulties! At present, we look at the PBE algorithms provided as standard in Sun's Java 6 implementation, although we unfortunately conclude that none of them are terribly great!
  • The section on Swing User interfaces covers various common topics, including a "rogue's gallery" of common Swing components, with a list of common constructors and listeners for those components, plus some hints on adding listeners to your code.
Further pages are planned for both of these sections very shortly. Watch this space!

Thursday, April 16, 2009

Arcmexer: a library for reading archive files in Java

A beta version of the Arcmexer library is now available from the Javamex site. Arcmexer is a library that allows you to read the contents of various types of archive file from Java. The idea is that the library could be useful in various data conversion and data recovery operations.

At present, the following are supported:
  • ZIP files, including those with files encrypted with 128-bit AES encryption or the traditional (but insecure) PKZIP encryption scheme. Note that other encryption algorithms (including 256-bit AES) currently aren't supported but may be in the future if I'm told that people need them. (See the page on reading encrypted ZIP files.)
  • Tar files, commonly used on UNIX platforms. Tar files may come directly off a tape drive (usually via the UNIX dd command), or are created on disk via the tar command and used to transport bundles of files.
  • GZIP-compressed tar files, commonly with the ending .tar.gz or .tgz.
Other archive formats are likely to be added to future versions if people nag me about them. (That means, leave a comment on this blog entry asking for it!)

As well as reading files from the archive, a method is provided that can aid in ZIP file password recovery.

Please bear in mind that this should very much be considered a beta version. I've found it the routines it contains useful and thought they could be useful for other people. If you encounter problems, please let me know! Comments can be left on this blog post, or on the site's Java discussion forum.

Wednesday, April 15, 2009

StreamCorruptedException

The How to fix... section, looking at common Java bugs and problems, has an additional section on a couple of common causes of StreamCorruptedException. This is a serialisation error that often occurs when the streams reading and writing serialised data get "out of kilter". It's surprisingly common for this to occur accidentally.

If you haven't seen the How to fix... section yet, then it's worth a look, as it deals with a couple of other common headaches such as OutOfMemoryError and StackOverflowError. If you've got another Java bĂȘte noir that you'd like to see covered there, please let us know by posting to the Javamex forum.

Saturday, April 11, 2009

Using strings in Java

Various questions and problems commonly arise with Java Strings. Especially for C programmers, strings in Java have certain quirks, notably the fact that they're immutable (once you've created a String object, you can't change its contents; if you need a mutable string, then you generally have to use some other CharSequence class such as StringBuilder).

At the aforementioned link, I show code examples of some common string functions in Java. Surely, there'll be things I've missed out/forgotton. So please let me know if there's some thing you find yourself needing to do with strings in Java that I haven't mentioned!

New section: Java cryptography

Several pages of the new section on Java Cryptography are now available for review and general criticism. Topics currently covered include:
Topics not currently covered, but planned for the near future include:
  • comparison of block cipher algorithms: performance and security considerations;
  • secure hash functions and authentication;
  • cryptographic protocols;
  • secure random number generation (note that the Java SecureRandom class is currently discussed in the site's section on random numbers in Java);
  • digital signatures.
As usual, comments about the currently published sections are always welcome, as are suggestions for future additions to the cryptography section or to any other section of the web site. Please leave comments either on this blog or in the Java cryptography section of the Javamex discussion forum.

Update 12/04/09: The section now contains some information on secure hash functions in Java, plus a comparison of encryption algorithms.
19/04/09 As discussed in a separate blog entry, some information on password-based encryption is now included.

Enjoy...!

Tuesday, March 10, 2009

PIFTS.EXE: Symantec finally own up

So, the world can rest easy in their beds. A message tucked away on Symantec's forums-- the same forums from which all communication about the issue was previously banned-- in which they have finally owned up to what happened:
  • they released a patch to do some boring things that any old patch might have done
  • but they released the patch unsigned, causing it to hit the firewall when it otherwise wouldn't have done
  • because some of the posts on the Symantec forum were judged to be abusive, all posts were pulled down.
Whilst this seems to be an astonishing example of customer relations, and has brought the world's attention to the kind of behaviour that such patches may be conducting on a routine basis, it does at least appear that the Feds are not about to plunder our computers for illicit chocolate chip cookie recipes. We were spared... this time.

(And yes, I did back up my recipe collection... just in case.)

What is the mysterious PIFTS.EXE?

Update 10/03/09 11:50am Luckily, it does appear that PIFTS.EXE is just a storm in a teacup. Symantec still appear to be saying about as much as the Queen did after Diana died.

So after a mysterious PIFTS.EXE program hits the Kaspersky firewall asking to connect out from one of our machines, I hit the Internet to find that nobody knows, but the world is wondering. According to Google Trends, it has been hovering between the 15th and 25th most frequent search for the last couple of hours. Various theories about PIFTS.EXE appear to be emerging: was it some component of Norton Antivirus that went wrong? Is it some mad terrorist plot to wipe the Internet off the face of the earth and thus prevent people from finding out about why Lil' Kim went to jail?

Update 10/03/09 Reading things so far, and just possibly maybe having had a quick look at a disassembly of the exe in IDAPro, the consensus seems to be that the file is essentially harmless, but was an attempt by Symantec to gather some statistics from users' machines about installed antivirus components. A user posting to Reddit also suggests that a code review of PIFTS.EXE does not reveal anything too nafarious, and that automated code analysers such as ThreatExpert don't pull anything up either (then again, a "well written" virus wouldn't do, would it?).

Sometimes in cases like this, it's not so much whether there is anything wrong but whether there appears to be. Pulling down all forum posts about the file when there is clear user anxiety without then making an official statement doesn't making it look as though everything's hunky-dory...

Thursday, January 22, 2009

Java Certification Tips

A new page of Java Certification Tips gives a "cribsheet" of some of the "syntactic niggles" and other commonly overlooked features of Java that could trip up an otherwise moderately experienced programmer when taking the Sun Certified Java Programmer (SCJP) exam. Did you know, for example that:

  • this is invalid Java syntax?
    float f = 2.5;
  • goto is a Java keyword?
  • this line does not assign the value 13 to the variable?:
    int i = 013;
See the page of tips for more information on these and other niggles.

Wednesday, January 21, 2009

Regular expression tutorial: new example

The Java regular expression tutorial section has been updated with a new example of using regular expressions. In this example, we look at how to perform what is sometimes referred to as HTML scraping: pulling out data from an HTML page (or indeed XML document).

The example is located here: HTML scraping with Java regular expressions. As explained in this tutorial, regular expressions are a good candidate for data scraping because they are flexible. Various libraries exist to parse an HTML or XML document and return an object representation of that document. But such libraries are often "too fussy": many if not most web pages actually do not conform strictly to HTML standards. Similarly, many XML parsing libraries are too fussy for real-life RSS feeds, which are often malformed, strictly speaking. Using regular expressions cuts through the fuss.

When should you scrape web pages?

Note that the article focusses on how technically to scrape HTML pages in Java. It doesn't deal with the "political" issue of whether the site in question wants its content scraped in the first place. In general, it is good practice to do the following:

  • find out if he web site in question has an API to provide the data in a more convenient format (and in a format that it would prefer to provide it to you in!)
  • be open about what you're doing: if the web site administrator things you're trying to "hide" something, they may block your IP address
  • respect server resources: if you are retrieving multiple pages, consider putting a thread sleep between fetches