File Integrity 3 : Where to store digests?

Having established that checking the integrity of files can be an important and useful thing to do, but that macOS and its standard file systems don’t support it, I then concluded that a file integrity checker should calculate the SHA256 digest on the data fork, which then needs to be associated somehow with that file, but probably not with a timestamp. This article considers how best to associate each file with its digest.

One of the most important considerations here is that the association must be general and as robust as possible. There’s no point in adopting a technique which only works with specific types of backup, and fails the moment the file is stored in, or passes through, iCloud or network storage.

The ideal location in which to store a file’s SHA256 digest is, of course, in its metadata, or attributes. Unfortunately, the standard attribute set for files on HFS+ and APFS volumes doesn’t include one for a digest, isn’t extensible to add an attribute for one, and there’s no existing attribute which could be repurposed for this. Furthermore, even if such an attribute were available, unless it was also supported by other file systems and cloud services, stored digests would be lost.

Three options remain: extended attributes, a separate associated file similar to the ._ mechanism used to store extended attributes in some other file systems (AppleDouble), and a hidden file in each folder/directory. I’ll take those in reverse order.

Folder-based digest lists

Maintaining a hidden file in each folder/directory in which the digests of other files in that folder are stored is a proven technique, used by diglloyd’s IntegrityChecker. So long as files remain inside that folder (in the sense that their relative path within that folder is fixed), this works well. However, it becomes complex to support moving and copying files beyond that folder.

You could in theory add a file’s digest and details to the hidden file in the destination folder, but that would exclude normal file operations performed using the Finder or at the command line. Folder storage of digests is therefore not suitable for many of the purposes for which we want integrity checking.

Flattened file formats

A family of file formats has been developed to preserve the additional metadata, including extended attributes, in macOS files. These include MacBinary, a binary format which combines data fork, extended attributes and additional metadata into a single binary file, and AppleDouble, in which the data fork is left unchanged and metadata combined into a shadow file of the same name prefixed with ._

MacBinary requires that its file contents are divided and reconstituted to their original form before they can be used, so has major drawbacks in this case. AppleDouble preserves the data fork for immediate use, which makes reconstitution unnecessary in many cases. However, the Finder and command tools don’t automatically recognise its shadow files on HFS+ and APFS file systems, so its use wouldn’t be transparent and would risk separation of the two components.

Extended attributes

Extended attributes (xattrs) are the macOS-supported mechanism for adding arbitrary metadata such as digests to files. They are fully supported on HFS+ and APFS file systems, and some other file systems such as FAT and its relatives use AppleDouble to preserve selected xattrs. Similarly, iCloud and most network transfer including AirDrop and supported file sharing (AFP, SMB) preserve selected xattrs, although some third-party cloud storage doesn’t. Xattrs are also stripped when burning to most removable media, unless the files are contained within a Disk Image.

Xattrs have acquired a reputation for fragility which has been largely founded on incomplete understanding of their behaviour. Xattrs have a flag system which determines whether they’re preserved or stripped during various operations. Appending #S to the name of a third-party xattr results in its being preserved during almost all file operations, including copying and duplicating, provided that the underlying file system supports xattrs.

Of the three options for storing digests for files, xattrs are the only one which ensures that, in the great majority of instances, the digest moves with its data, even across to different Macs, regardless of the location of the tagged file.

Xattrs and backups

One behaviour peculiar to xattrs for the storage of digests is that adding, changing or removing a xattr attached to a file doesn’t change the file’s modification datestamp, although the action is still recorded in the local FSEvents database. In some situations, this is advantageous, in others it isn’t helpful. These are best illustrated by looking at backups.

Most third-party backup systems for macOS rely on file modification dates to determine whether items need to be backed up. If a file is tagged with a xattr containing its current digest, that file’s modification date remains unchanged, and a fresh copy of the file (or even its xattrs) isn’t added to the backup.

One solution to this is for the user to tag items in a folder which is excluded from being backed up. Once they’re tagged, they can then be moved back to their normal locations, where they should be included in the next backup. This wouldn’t of course be feasible for the whole of a Home folder, but should prove straightforward for most important documents, whose integrity is essential.

This problem should also be reduced or eliminated by tagging documents with their digests immediately after editing their data so that they are backed up with those digests, rather than later in large batches.

Time Machine has the opposite problem. Because, in Catalina and earlier versions of macOS, it normally relies on the FSEvents database to determine which items are to be backed up afresh, and adding or changing xattrs is recorded as a change in FSEvents, every change made to stored digests in a xattr will result in the item being backed up in full. This can result in repeated full backups of files with very large data forks which haven’t changed, but a mere 32 bytes in a xattr have.

The best answers in both cases will only come with wider use of digests being saved to xattrs, and their accommodation in backup utilities. Time Machine’s behaviour already causes problems in other situations, for example when an app adds a quarantine flag to an existing file, which triggers the whole of that file to be copied into the next backup.

This leads us to store each SHA256 digest in a custom xattr attached to that file.