Wednesday, October 8, 2014

The HyperLogLog structure: estimating the number of distinct values on "Big Data" with minimal memory

Imagine you have the following scenario: you want to count the number of "distinct" values that you have seen from a stream of data without actually storing all of the values. (For example, you want to answer a question such as "how many distinct IP addresses have made requests to my server" on the fly.) Conceptually, you want to place all of the values (in this case, IP addresses) into a set and then count the size of the set. But you want to do this efficiently on a potentially huge range of possible values (so that keeping the entire set in memory is not viable and placing the values in a database is not practical), even at the expense of the count being approximate.

For a moderate number of possible values, one potential implementation that I have written about before involves using a bloom filter (which is essentially an "approximate set").

But a more memory efficient solution comes in the form of the HyperLogLog algorithm. Effectively, this algorithm allows you to estimate the number of distinct values seen by gathering broad statistics on the hash codes seen so far. To use an easy-to-understand analogy presented in the article, if I told you that I had tossed a coin n times and got a maximum run of 10 heads in a row, you would estimate that I had tossed the coin fewer times than if I told you that I had got a maximum run of 20 heads in a row, and you could use some statistics to estimate n. The same essentially goes if I were to say, "out of all the hash codes of values seen so far, the maximum run of 0s in the lower bits was x", and I can take the average of various statistics of this type to hone the estimate.

It may not be used often for all applications, but if you are in the business of "counting events on big data", then the HyperLogLog algorithm appears to be a very worthwhile weapon to have in your arsenal. I'm sure I'll be adding it to mine.

No comments: