Javamex News and Development: 64-bit hash function

Tuesday, September 27, 2011

Even 'random junk' may need encryption

There appear to be some interesting legal cases springing up around the fact that software developers have been taking "raw" Unique Device IDs and using them to store data against a given user 'in the cloud'. UDIDs essentially resemble a long 'random number' rather than directly encoding any user-identifiable information, so you may be forgiven for wondering what the issue is. The problem comes when applications have then stored user-identifiable information (such as social network data) alongside the UDID and make this information available to other applications, thus leading to a 'leaking' of user-identifiable information on the basis of this shared UDID.

The solution is in principle simple: append the UDID to an application-specific code or 'salt' and then encode this combination using a secure hash scheme such as SHA-256. The application-specific salt doesn't need to be a secret. But doing this means that I cannot compare a hash for user X from one application with a hash for user X from another application and deduce that they are for the same user. Again, in principle.

However, this scheme does rely on UDIDs containing sufficient entropy. Since the number of devices of a particular model sold is maybe in the tens or at most hundreds of millions (iOS devices appear to have sold something in the order of 200 million so far), if it is possible to make a good prediction of which range of the theoretically possible UDIDs has actually been allocated, then I can simply pre-compute all the potential combinations of (allocated hashes, application ID) for two given applications and find all the matches.

I haven't yet looked into all the details, but it appears that iOS 5 will remove the ability to read device-wide UDIDs in favour of application-specific UDIDs, presumably generated using a scheme such as the above. Although this is arguably the scheme that should have been adopted from the outset, if true, it will interesting to see what incompatibilities this throws up.

So why didn't people think of this in the first place? I wonder if one of the issues is that to a human, UDIDs do just look like 'random junk': it's just a string of random numbers, right, so why would you bother encrypting it? It's a good example of how when deciding when and how to employ encryption, we have to think not only about the data itself but also about protocols and practices governing how that data is used and stored.

Saturday, January 15, 2011

New articles in the 'Java collections' section

Readers are invited to explore some new material that has been added to the Java collections section of the web site. By rights, this material would probably belong in a "Java algorithms" section, as they don't use the collections framework as such, but rather look at structures closely related to those used in the standard collections classes.

In the section on Advanced use of hash codes, some new material firstly looks at how we can implement a strong, 64-bit hash code in Java, and how we can use this to create a Java hash map implementation that saves memory by storing only the keys.

A further section on Bloom filters builds on this strong hash function further, allowing a set of objects to be represented very compactly in memory in return for a certain "false positive rate". This tradeoff is useful in certain situations where we want to "weed out" certain items that require further, resource-intensive processing from those that don't (for example, when wanting to reduce database hits), but where it doesn't matter too much if a small percentage of items "slip through the net". We look more closely at how the initialisation parameters of the Bloom filter can be configured to tune the false positive rate in return for more or less memory space being required.