File Integrity 2 : Which digest?

In the first article in this series, I looked at why we need to be able to check the integrity of files, and what macOS doesn’t provide to do this. If we’re going to use software to meet this need, then we need to decide what that software should do, which is the subject of this article.

Although in principle there’s no reason that third-party software couldn’t implement error-correcting codes (ECC), in practice adding this on is too complex, at least for me to attempt. What we can do, without trying to piggyback anything onto APFS, is provide a robust means of checking the integrity of the data within files. To do that requires a software-generated ‘signature’ which is unique to each file.

One traditional approach, which you’ll still see used in some places, is to calculate a checksum, which is also very quick. The most popular of these is the Cyclic Redundancy Check, or CRC. Wikipedia has an excellent article which explains in detail exactly how these are computed. These are suitable for some communications protocols, such as Ethernet, and in some file types, but aren’t robust enough for a general file integrity check.

For the risks which I listed in the first article, we really need a cryptographic hash function, which can be generated for the entire file, whether it’s a couple of KB or hundreds of GB. An important measure of the suitability of a hash function is how easy it is to generate a ‘collision’, that’s two different files which have the same hash value. For example, for the old (and now largely disused) hash function MD5, your Mac can find collisions in just a few seconds.

Although not as resistant to attack as some methods, the 256-bit version of SHA-2, commonly known as SHA-256 or SHA256, is now widely used for this type of purpose. It’s fairly resistant to collisions, reasonably quick to calculate even on large files, and generates a 32-byte number, known as the digest. A more robust and potentially quicker option is BLAKE3, which was only released in January.

One significant advantage of SHA256 is that it’s widely used in macOS, offered in Apple’s Common Crypto library prior to 10.15, and now in Catalina’s CryptoKit. Those two versions are compatible, in that they will both generate the same digest from the same input file. CryptoKit provides an interface to the SHA256 hashing function which is more secure, and less likely to result in vulnerabilities. It should also perform well.

Cryptographic hash functions like SHA256 and BLAKE3 have another valuable property: they amplify even the smallest change in the data used to generate a digest. Changing a single x to a y in the middle of an 2 MB text file, for example, changes its digest from
0x437c23ba127389351e7e8687f71ced911875d1c133e0f1ee3501ae8b2c800b03 to
0xd2b49d470bcf215036f209ee127176ef55141ed6fb76ba82294c92d93e6bf278. This makes them particularly suitable for use in the detection of ‘bit rot’, which may result in such small changes over long periods of time.

To give you an idea of the difference between different types of checksum or hash, an example file generates the following digest results:
CRC16 0x53D4 CRC32 0xD0358490 MD5 0xca9cf564fcc3a2950bdf70d9286d8bb0 SHA256 0x437c23ba127389351e7e8687f71ced911875d1c133e0f1ee3501ae8b2c800b03
You can look at this in detail using Hash and similar apps available from the App Store, which you can use to perform manual integrity checks. As you can see from the length of the digests, finding two different files with identical CRC16 values is going to be relatively easy. Finding two with identical SHA256 digests is almost impossible.

Another important question is whether the digest should be calculated for just the data stored in the file’s data fork, or whether it should include metadata, in particular the file’s extended attributes (xattrs). Although in general data stored in xattrs isn’t part of the content of the file and is often ephemeral in nature, it’s not simple to determine whether any given xattr merits inclusion in a digest. Some xattrs, such as the quarantine flag, can change frequently in any case. It’s therefore better to calculate the digest for just the data fork of each file.

Calculating a digest for a file only fingerprints that file at an instant in time. One question that I haven’t yet resolved is whether each digest needs to be accompanied by the timestamp of its creation, much in the way that code signatures in macOS are. This is potentially useful in deciding what to do with a file which fails integrity checks: knowing exactly when its digest was created could help the user locate a backup copy which corresponded to the version which failed checking. However, the timestamp can’t tell the user when any corruption occurred, only a moment in time before that happened. Currently, I’m not convinced that a timestamp would serve any useful purpose.

So, if we’re going to use a file integrity checker, it should calculate the SHA256 digest on the data fork, which then needs to be associated somehow with that file, but probably not with a timestamp.

In the next article, I will consider how to associate the digest and file.

4Comments

Add yours

1

xz4gb8 on April 6, 2020 at 12:49 pm

Timestamps are vital to forensics, aiding in establishing an evidentiary trail.

LikeLiked by 1 person
- 2
  
  hoakley on April 6, 2020 at 4:13 pm
  
  Yes, I agree. But would they serve a good purpose here, or just waste cycles and bytes of storage?
  Howard.
  
  LikeLike
  - 3
    
    Pico on April 7, 2020 at 4:36 am
    
    One potential benefit to including a timestamp would be for Dintch to be able decide whether a files hash should be recalculated when re-tagging. This of course could be a user optional feature as a way to speed up re-tagging for large selections. Assuming, of course, that it’s considerably faster to compare the files last modification date to the hash’s timestamp than to generate a new hash.
    
    Another potential benefit is that the hash timestamp would be an extra point of context when a hash has changed but the last modification date hasn’t, Dintch could more strongly suggest that the file may have been corrupted in this case vs a files modification date being newer than the hash timestamp.
    
    LikeLiked by 1 person
    - 4
      
      hoakley on April 7, 2020 at 7:50 am
      
      Thank you, Pico.
      I have considered both of those.
      When retagging a file, the last modification date is of no practical value, as many of the ways in which the data could have become changed wouldn’t be reflected in any change in that datestamp. Otherwise why bother to calculate digests at all – just trust the datestamp? The whole purpose of generating and storing digests is that you can simply trust the datestamp, but have to check the integrity of the data.
      I’m also unsure of what the second really tells you. It doesn’t tell you when the change occurred, which is perhaps the most useful thing you could get from a datestamp, only an open-ended period over which the change occurred, which could be many months or years. It doesn’t tell you what changed the data, simply whether that change was performed through file system or not. Does that help you make any decisions about what to do with that file? I don’t think so.
      One issue with datestamps is that they’re remarkably easy to forge. To secure them isn’t easy, and could involve hashing or encrypting them, which starts getting very messy. For forensic purposes, a datestamp which could have been altered has very limited value.
      Howard.
      
      LikeLike

Share this:

Related