File Integrity 2 : Which digest?

In the first article in this series, I looked at why we need to be able to check the integrity of files, and what macOS doesn’t provide to do this. If we’re going to use software to meet this need, then we need to decide what that software should do, which is the subject of this article.

Although in principle there’s no reason that third-party software couldn’t implement error-correcting codes (ECC), in practice adding this on is too complex, at least for me to attempt. What we can do, without trying to piggyback anything onto APFS, is provide a robust means of checking the integrity of the data within files. To do that requires a software-generated ‘signature’ which is unique to each file.

One traditional approach, which you’ll still see used in some places, is to calculate a checksum, which is also very quick. The most popular of these is the Cyclic Redundancy Check, or CRC. Wikipedia has an excellent article which explains in detail exactly how these are computed. These are suitable for some communications protocols, such as Ethernet, and in some file types, but aren’t robust enough for a general file integrity check.

For the risks which I listed in the first article, we really need a cryptographic hash function, which can be generated for the entire file, whether it’s a couple of KB or hundreds of GB. An important measure of the suitability of a hash function is how easy it is to generate a ‘collision’, that’s two different files which have the same hash value. For example, for the old (and now largely disused) hash function MD5, your Mac can find collisions in just a few seconds.

Although not as resistant to attack as some methods, the 256-bit version of SHA-2, commonly known as SHA-256 or SHA256, is now widely used for this type of purpose. It’s fairly resistant to collisions, reasonably quick to calculate even on large files, and generates a 32-byte number, known as the digest. A more robust and potentially quicker option is BLAKE3, which was only released in January.

One significant advantage of SHA256 is that it’s widely used in macOS, offered in Apple’s Common Crypto library prior to 10.15, and now in Catalina’s CryptoKit. Those two versions are compatible, in that they will both generate the same digest from the same input file. CryptoKit provides an interface to the SHA256 hashing function which is more secure, and less likely to result in vulnerabilities. It should also perform well.

Cryptographic hash functions like SHA256 and BLAKE3 have another valuable property: they amplify even the smallest change in the data used to generate a digest. Changing a single x to a y in the middle of an 2 MB text file, for example, changes its digest from
0x437c23ba127389351e7e8687f71ced911875d1c133e0f1ee3501ae8b2c800b03 to
0xd2b49d470bcf215036f209ee127176ef55141ed6fb76ba82294c92d93e6bf278. This makes them particularly suitable for use in the detection of ‘bit rot’, which may result in such small changes over long periods of time.

To give you an idea of the difference between different types of checksum or hash, an example file generates the following digest results:
CRC16 0x53D4
CRC32 0xD0358490
MD5 0xca9cf564fcc3a2950bdf70d9286d8bb0
SHA256 0x437c23ba127389351e7e8687f71ced911875d1c133e0f1ee3501ae8b2c800b03

You can look at this in detail using Hash and similar apps available from the App Store, which you can use to perform manual integrity checks. As you can see from the length of the digests, finding two different files with identical CRC16 values is going to be relatively easy. Finding two with identical SHA256 digests is almost impossible.

Another important question is whether the digest should be calculated for just the data stored in the file’s data fork, or whether it should include metadata, in particular the file’s extended attributes (xattrs). Although in general data stored in xattrs isn’t part of the content of the file and is often ephemeral in nature, it’s not simple to determine whether any given xattr merits inclusion in a digest. Some xattrs, such as the quarantine flag, can change frequently in any case. It’s therefore better to calculate the digest for just the data fork of each file.

Calculating a digest for a file only fingerprints that file at an instant in time. One question that I haven’t yet resolved is whether each digest needs to be accompanied by the timestamp of its creation, much in the way that code signatures in macOS are. This is potentially useful in deciding what to do with a file which fails integrity checks: knowing exactly when its digest was created could help the user locate a backup copy which corresponded to the version which failed checking. However, the timestamp can’t tell the user when any corruption occurred, only a moment in time before that happened. Currently, I’m not convinced that a timestamp would serve any useful purpose.

So, if we’re going to use a file integrity checker, it should calculate the SHA256 digest on the data fork, which then needs to be associated somehow with that file, but probably not with a timestamp.

In the next article, I will consider how to associate the digest and file.