Wednesday, July 16, 2008

Advanced use of hash codes

For many purposes, the plain old HashMap and related collections serve their purpose well and you don't need to worry too much about their gory details.

But in some cases, some more advanced hashing-related techniques can help us save memory. In a new section on advanced hashing techniques, we discuss using a BitSet to perform duplicate elimination with a certain degree of error but a significant memory saving. Knowing a little about the statistics of hash codes helps us work out the degree of error involved. For example, with a BitSet taking up around 120K, we can perform duplicate elimination on tens of thousands of strings with a 'false positive' rate of around 2%. This is a technique that I use, for example, in gathering web site statistics, and in similar cases where I want the functionality of a hash map or hash set but can accept a certain degree of error. The technique allows what would otherwise be quite a large amount of data to be 'held in memory', thus significantly reducing database hits, for example.

Have you been using hash codes in an interesting way that I haven't mentioned? If so, I'd be interested to hear.

No comments: