How should we check the integrity of important files?

As APFS doesn’t offer any feature to verify the integrity of data in our important files, the only option is to design an app to handle that. As I’ve already done that, this article steps through my rationale and design decisions.

Choosing the digest

There’s a wide range of methods for computing message digests, single figures that can be used as a ‘fingerprint’ or ‘summary’ of the contents of a complete file. These fall into two main categories: checksums and cryptographic hashes. Although checksums can be much faster to calculate, they can also suffer some surprising shortcomings. For example, according to Wikipedia:
“The Fletcher checksum cannot distinguish between blocks of all 0 bits and blocks of all 1 bits. For example, if a 16-bit block in the data word changes from 0x0000 to 0xFFFF, the Fletcher-32 checksum remains the same. This also means a sequence of all 00 bytes has the same checksum as a sequence (of the same size) of all FF bytes.”

Properties of cryptographic hashes have been explored more extensively, and those based on the SHA-2 standards are generally accepted as being thoroughly reliable. Apple provides support for three implementations: SHA-256, SHA-384 and SHA-512 in its Common Crypto (macOS 10.14 and earlier) and CryptoKit (macOS 10.15 and later) APIs.

Two characteristics are of particular importance here: speed and the size of the resulting digest. To compare these, tests were performed using files of 1 and 10 GB size on the M1 Max chip in a Mac Studio. SHA-256 in CryptoKit consistently delivered a hashing speed of 2.1 GB/s, for a digest length of 32 bytes; SHA-512 delivered a significantly lower speed of only 1.3 GB/s and a larger digest of 64 bytes. As SHA-384 uses the same algorithm as SHA-512, it too delivered 1.3 GB/s.

While modern substitutes for SHA-256 might deliver higher speeds, lack of support in both Common Crypto and CryptoKit is a deterrent to their adoption in macOS. I therefore decided to use SHA-256 throughout.

Storing hashes

As a general principle, important metadata such as message digests of files should be associated with the files themselves, in the form of an extended attribute. Storing hashes in a separate directory manifest makes them dependent on the contents of directories remaining unchanged, which might work well on read-only storage media, but isn’t suitable when directory contents can be changed and verified files moved around.

To make the extended attribute of type co.eclecticlight.dintch.hash as persistent as possible, the flag #S should be attached.

Performance

Three functions are required as a minimum: add a freshly computed hash to each file, check whether an existing hash matches that of the file data, and update hashes so that each matches its current file data. In all three cases, the rate-limiting step is identical, the computation of the SHA-256 hash for the file data. No significant differences were seen between performance of those three features in Dintch, when all files were stored on the internal SSD of the Mac Studio.

Tuning the size of the buffer to be used makes relatively little difference to overall performance: when checking a single 10 GB file, time required varied little with buffer sizes from 512 KB to 2 MB.

Overall processing speed, including all operations for any of the three features, when run at either of the faster speed settings in Dintch were:

  • single 10 GB file – 2.1 GB/s
  • five 1 GB files – 2.0 GB/s
  • 15 files totalling 10.7 GB – 2.0 GB/s
  • 121 files averaging 263 KB each – 114 MB/s, or 434 files/s.

When run exclusively on E cores, at the slowest speed setting, speeds fell to 0.6 GB/s for larger files, and 22 MB/s (85 files/s) for the last, small-file test group.

Error correction

Detecting errors by checking message digests is important and useful, but only part of the solution. Should a discrepancy arise between the SHA-256 hash of a file and its previous value, wouldn’t it be more helpful if the error could be corrected too?

Unfortunately, although the problems don’t appear too dissimilar, error-correcting codes are considerably more complex, and require substantial amounts of additional storage if they are to be effective against anything more than the most trivial of errors. The great majority of work on error-correction has concentrated on streams of data transmitted in radio signals, or over networks, and little has been devoted to files in storage.

The worst case for any file is total loss, either because all the data has been deleted, or it has been damaged throughout. In signal transmission, that situation would normally be handled by requesting retransmission, an option not open when the only intact copy of a file has been lost or destroyed.

Conventional redundancy techniques for files store multiple copies, for example in RAID 1 mirrors. These are inefficient, as each redundant copy requires the full size of the original. One more efficient alternative for files that don’t change frequently is to store redundant copies using non-lossy compression. Some forms of compression readily available in macOS, such as Apple Archive, preserve extended attributes, so can accommodate message digests, use multiple cores efficiently, and preserve special formats such as sparse files. They appear best-suited to such redundant storage schemes.

Summary

  • Message digests should be SHA-256 hashes, computed using CryptoKit where available, or Common Crypto where that’s not supported.
  • Message digests should be saved as extended attributes, made persistent using the #S flag.
  • On larger files, this should see processing at a rate of about 2 GB/s on faster Apple silicon Macs.
  • An option for background processing should yield up to 0.6 GB/s on larger files.
  • Error-correction is best achieved using redundant copies, complete with message digests, and compressed using Apple Archive for greatest efficiency.