For a moderate number of possible values, one potential implementation that I have written about before involves using a bloom filter (which is essentially an "approximate set").

But a more memory efficient solution comes in the form of the HyperLogLog algorithm. Effectively, this algorithm allows you to estimate the number of distinct values seen by gathering broad statistics on the hash codes seen so far. To use an easy-to-understand analogy presented in the article, if I told you that I had tossed a coin

*n*times and got a maximum run of 10 heads in a row, you would estimate that I had tossed the coin fewer times than if I told you that I had got a maximum run of 20 heads in a row, and you could use some statistics to estimate

*n*. The same essentially goes if I were to say, "out of all the hash codes of values seen so far, the maximum run of 0s in the lower bits was

*x*", and I can take the average of various statistics of this type to hone the estimate.

It may not be used often for all applications, but if you are in the business of "counting events on big data", then the HyperLogLog algorithm appears to be a very worthwhile weapon to have in your arsenal. I'm sure I'll be adding it to mine.

## No comments:

Post a Comment