There many occasions when we need a ‘fingerprint’ of a file or other data. They can be used to check the integrity of a file, download or message, to verify the authenticity of something more precious, such as a passphrase, even for tasks such as testing whether two chunks of text are the same. Depending on the method used and its behaviour, these are variously known as checksums, CRCs and hashes.
At their heart is a common task: to reduce a variable amount of data to a single fixed-length number in a way that the number is distinctive of the data. That number is the checksum or hash of the data.
A simple example might be to add together all the bytes in the data to make a single 32-bit integer, ignoring any carries in the addition. That 32-bit integer is then the checksum. Examples of what you could do with a checksum or hash include:
- check whether two different chunks of data are identical, by comparing their checksums;
- check that copies of the data are identical, but comparing their checksums;
- check whether a file has been corrupted when dowloaded, by comparing it against its known checksum;
- index large collections of data using the checksum or hash instead of the data itself.
To do this, checksums and hashes must be quick to calculate, and checksums for different data must be different. When two different chunks of data result in the same checksum or hash, that’s known as a collision, and is every bit as bad as that sounds.
Clearly, the longer the checksum, the lower the risk of collisions. A single byte could be very quick to calculate and economical on storage, but with only 256 different values, collisions would be too common for it to be of any practical use. One significant factor in the likelihood of collisions is that all different values of the checksum/hash must be of approximately equal probability.
Basic non-cryptographic hashes are widely used throughout computing. For example, when comparing two text strings to see if they’re the same, it’s often far quicker to compare their hashes rather than step through comparing every character in the strings. In Swift, for instance, many data types are hashable, making hashes available for such purposes.
A common form of checksum is the Cyclical Redundancy Check, CRC-32, which is used to generate a 32-bit number as a check of the integrity of a file which is transmitted. This has been incorporated into many standards for Ethernet, SATA, and various compression methods, as a check of message or data integrity. Fletcher checksums are an alternative which can be faster to compute and perform similarly for their length.
Longer and more sophisticated hash functions are designed to reduce the chance of collisions, so that they become resistant to deliberate attacks, such as a crafted file having the same hash as an innocent but important one. Those which prove most resistant are usually known as cryptographic hashes, and are often incorporated into security systems. Important properties of cryptographic hash functions include:
- There’s a one-to-one mapping between input data and hash, so the same data always generates the same hash.
- The hash is quickly computed using current hardware.
- It’s not feasible to work out the input data for any given hash, making the mapping one-way.
- Collisions are so rare as to not occur in practice.
- Small changes in the input data should result in large changes in the hash, so amplifying any differences.
- Hash values should be fairly evenly distributed.
One famously failed hash is SHA-1, which uses 160-bit (20-byte) numbers often known as a message digest. In 2005, it was demonstrated that those with sufficient computing resources could break its security, and more recently two different PDF files with the same SHA-1 hash have been found. A predecessor to SHA-1, MD5, is even less resistant to attack, and has largely been abandoned too.
Modern cryptographic hashes still trusted include improved and longer versions of SHA-1 in SHA-256, SHA-384 and SHA-512, and BLAKE3, which is currently one of the best-performing.
macOS has built-in support for cryptographic hashes, and uses them extensively in many of its security features. Notable examples include code signatures, which include ‘cdhashes’ of the protected parts of each app, bundle, etc. These are relatively independent of signing certificates, and the underlying reason for M1 Macs needing all native executable code to be signed. Cryptographic hashes are also used to verify the integrity of the Sealed System Volume in Big Sur, where they’re assembled into a hierarchy like a Merkle tree.
More generally, cryptographic hashes are used in message authentication codes (MAC) to verify data integrity in TLS (formerly SSL), and they’re often used in the process of pseudonymisation to protect the identities of individuals who take part in research projects.
My own suite of apps for verifying file integrity, Dintch, Fintch and cintch, use SHA-256 hashes provided by Common Crypto in macOS.