High quality hash

No matter how well you get to know OS X, its comprehensive capabilities never cease to surprise and delight.

I recently wanted to carry out a survey involving sensitive personal information. Under the Data Protection Act 1998, one robust approach is to render those data anonymous before performing any analysis on them, so I was looking for a secure way of achieving that.

I could of course simply number each subject, and only store that number, instead of data fields that could be used to identify them. Unfortunately this was not a suitable solution, as I wanted to be able to add more data about both new and existing subjects in the future.

Creating an index file, which would allow me to look up the unique subject identifier for a given subject name, was not a good answer either, as it would enable reverse lookup: anyone using the anonymised database could then use the index file to identify whose data were whose, defeating the purpose of anonymisation.

So I needed a one-way method of generating a unique key for each subject name that would not permit the reverse process, of working out the name for a given key: something known as a ‘one-way hash key’, and widely used to fingerprint files (such as OS X security patches) whose integrity needs to be checked. You will see these quoted by Apple on an update page – for example, for the 10.4.2 update it states “SHA1= MacOSXUpdate10.4.2.dmg= 5a11375c29f1f656061189b9467cf9291153de46”. When you had downloaded that file, you could have used the Terminal command
openssl dgst -sha1 MacOSXUpdate10.4.2.dmg
to verify that its SHA-1 hash key is indeed as stated, and thus that the file is ‘good’. This is more reliable than using a CRC32 checksum, giving you greater assurance that no-one has tampered with the update, and is described here.

So to perform effective and one-way anonymisation, all I need do is compute the SHA-1 hash key for a subject’s name, and then refer to that data in my database by the hash key. If I ever want to add more data for that subject, I can readily work out their hash key again, but I cannot browse the database and use a hash key to work out the name of that subject. My anonymisation step is therefore secure, as it only permits one-way lookup, and once anonymised no-one can reverse the process.

I don’t have to buy any additional software tools, or develop my own code, as OpenSSL, built into OS X, contains a proven tool for generating SHA-1 hash keys that meets the requirements of the US standard FIPS PUB 180-4. Even the eminent cryptographer Bruce Schneier has agreed that for this sort of application SHA-1 remains acceptably robust (see here), although more secure alternatives are needed for the likes of SSL encryption.

The final key element in this solution is to ensure that some of the information encoded through the hash key cannot be worked out or guessed. Although I will use a word taken from a protected record, if you use this to anonymise a Web-based questionnaire, you might ask subjects to give their town of birth, or mother’s maiden name. Provided that the subject gives the same information each time, and thus the data input to the hash key computation remains the same, you can always identify that subject again.

The tools that you need to do this are already on your Mac – it is just a matter of realising their potential, in contrast to Windows (even 8, apparently) or Mac OS 9, which were woefully incomplete.

Updated from the original, which was first published in MacUser volume 21 issue 18, 2005.