hoakley July 10, 2021 Macs, Technology

Explainer: checksums, CRCs, hashes and cryptography

There many occasions when we need a ‘fingerprint’ of a file or other data. They can be used to check the integrity of a file, download or message, to verify the authenticity of something more precious, such as a passphrase, even for tasks such as testing whether two chunks of text are the same. Depending on the method used and its behaviour, these are variously known as checksums, CRCs and hashes.

At their heart is a common task: to reduce a variable amount of data to a single fixed-length number in a way that the number is distinctive of the data. That number is the checksum or hash of the data.

A simple example might be to add together all the bytes in the data to make a single 32-bit integer, ignoring any carries in the addition. That 32-bit integer is then the checksum. Examples of what you could do with a checksum or hash include:

check whether two different chunks of data are identical, by comparing their checksums;
check that copies of the data are identical, but comparing their checksums;
check whether a file has been corrupted when dowloaded, by comparing it against its known checksum;
index large collections of data using the checksum or hash instead of the data itself.

To do this, checksums and hashes must be quick to calculate, and checksums for different data must be different. When two different chunks of data result in the same checksum or hash, that’s known as a collision, and is every bit as bad as that sounds.

Clearly, the longer the checksum, the lower the risk of collisions. A single byte could be very quick to calculate and economical on storage, but with only 256 different values, collisions would be too common for it to be of any practical use. One significant factor in the likelihood of collisions is that all different values of the checksum/hash must be of approximately equal probability.

Basic non-cryptographic hashes are widely used throughout computing. For example, when comparing two text strings to see if they’re the same, it’s often far quicker to compare their hashes rather than step through comparing every character in the strings. In Swift, for instance, many data types are hashable, making hashes available for such purposes.

A common form of checksum is the Cyclical Redundancy Check, CRC-32, which is used to generate a 32-bit number as a check of the integrity of a file which is transmitted. This has been incorporated into many standards for Ethernet, SATA, and various compression methods, as a check of message or data integrity. Fletcher checksums are an alternative which can be faster to compute and perform similarly for their length.

Longer and more sophisticated hash functions are designed to reduce the chance of collisions, so that they become resistant to deliberate attacks, such as a crafted file having the same hash as an innocent but important one. Those which prove most resistant are usually known as cryptographic hashes, and are often incorporated into security systems. Important properties of cryptographic hash functions include:

There’s a one-to-one mapping between input data and hash, so the same data always generates the same hash.
The hash is quickly computed using current hardware.
It’s not feasible to work out the input data for any given hash, making the mapping one-way.
Collisions are so rare as to not occur in practice.
Small changes in the input data should result in large changes in the hash, so amplifying any differences.
Hash values should be fairly evenly distributed.

One famously failed hash is SHA-1, which uses 160-bit (20-byte) numbers often known as a message digest. In 2005, it was demonstrated that those with sufficient computing resources could break its security, and more recently two different PDF files with the same SHA-1 hash have been found. A predecessor to SHA-1, MD5, is even less resistant to attack, and has largely been abandoned too.

Modern cryptographic hashes still trusted include improved and longer versions of SHA-1 in SHA-256, SHA-384 and SHA-512, and BLAKE3, which is currently one of the best-performing.

macOS has built-in support for cryptographic hashes, and uses them extensively in many of its security features. Notable examples include code signatures, which include ‘cdhashes’ of the protected parts of each app, bundle, etc. These are relatively independent of signing certificates, and the underlying reason for M1 Macs needing all native executable code to be signed. Cryptographic hashes are also used to verify the integrity of the Sealed System Volume in Big Sur, where they’re assembled into a hierarchy like a Merkle tree.

More generally, cryptographic hashes are used in message authentication codes (MAC) to verify data integrity in TLS (formerly SSL), and they’re often used in the process of pseudonymisation to protect the identities of individuals who take part in research projects.

My own suite of apps for verifying file integrity, Dintch, Fintch and cintch, use SHA-256 hashes provided by Common Crypto in macOS.

20Comments

Add yours

1

Rocky on July 10, 2021 at 8:26 pm

I spent decades managing petabytes of irreplaceable data scattered across millions of files on too many short-lived spinning hard drives. I had lots of time to worry about data integrity, checksums, and hashes at what was then a very large scale.

Here are a few of the finer points that many people miss:

– 16-bit TCP checksums are not nearly big enough to guarantee large-file integrity. Use much bigger and better checksums before and after moving files across networks.

– Moving files across internal system buses isn’t guaranteed to be error free. People at CERN and other massive data locations learned that the hard way. Even fleeting trips through RAM can introduce errors.

– If you have a file and a checksum, and a recompute shows a mismatch, which one was corrupted – the file or the checksum? If you really care about file integrity, store at least two, somewhat unrelated, high-quality hashes. And check both of them regularly. Using networks, system buses, and RAM of course. Quite the rabbit hole.

– There are no magic bullets for data integrity. RAID is not the answer – I have the scars to prove it. Same for ZFS.

– But data throughput can be a killer when you need to restore on a massive scale, or recompute checksums.

– Librarians know more than you might think about long-term data integrity. LOCKSS – Lots Of Copies Keep Stuff Safe. For important data, keep copies offline. With hashes. As many as you can afford. In as many scattered locations as you can afford.

– The forward march of technology is a killer. Not that many years ago I left behind thousands of 9-track tapes and short-lived removable disks with irreplaceable data. Couldn’t read them, couldn’t bear to throw them out.

– Resign yourself to losing some data forever. Eventually your management or successor will not be obsessed with the integrity of “old” data. Somehow the world muddles through.

LikeLiked by 1 person
- 2
  
  hoakley on July 10, 2021 at 9:51 pm
  
  Thank you.
  There’s a simple answer to your conundrum of which is corrupt, the file or the checksum: probability. What’s the chance of several GB or more of data being corrupted somewhere, versus the chances of corruption in just 256 bits?
  And thankfully SSDs are a game-changer too.
  Howard.
  
  LikeLike
  - 3
    
    Duncan on July 13, 2021 at 5:04 am
    
    It seems the easy solution is to simply store two copies of the checksum, which entails almost no overhead. Have a ‘Paranoid’ option that can be enabled for those so inclined (like me).
    
    LikeLiked by 1 person
    - 4
      
      hoakley on July 13, 2021 at 10:52 pm
      
      Thank you. I disagree completely. If the checksum doesn’t match that of the file, both are deleted and replaced by a copy from a backup, in which the match is correct. No paranoia is involved – you don’t and can’t know what has caused the mismatch, so can only return to a copy. Adding another checksum, which could differ again, only complicates matters and tempts you to retain data which is no longer beyond suspicion.
      Howard.
      
      LikeLike
  - 5
    
    Duncan on July 13, 2021 at 5:08 am
    
    To add: I see that’s what Rocky said as well. The trick is to make the implementation dead-simple, both for saving the extra copy but also for verifying that the two copies match. No extra steps should be required.
    
    LikeLike
6

Nigel Barker on July 12, 2021 at 6:48 am

Excellent overview article. I knew all this stuff once before I gave up my career in IT over 10 years ago. Thanks for the reminder & update.

LikeLiked by 1 person
- 7
  
  hoakley on July 12, 2021 at 7:10 am
  
  Thank you.
  Howard.
  
  LikeLike
8

coxorange on July 14, 2021 at 1:31 am

Hello,

My question is about checksums… / Time Machine backups to APFS-formatted hard drives.

Some years ago I read:
“APFS does not provide checksums for user data.”
https://en.wikipedia.org/wiki/Apple_File_System
More details here:
https://arstechnica.com/gadgets/2016/06/a-zfs-developers-analysis-of-the-good-and-bad-in-apples-new-apfs-file-system/

But I also read that in the future there shall be improvements. Has this been addressed in the meantime?

My situation:
I finally need to upgrade my iMac from Catalina to Big Sur. I’m using two external WD USB hard drives (alternately, once a day) with Time Machine to have two backups (“Mac OS Extended (Journaled)”). Yesterday one of the backup drives failed (out of warranty), and I want to replace both drives with new, larger ones.

I’m a bit worried because my iMac (Fusion Drive) is a bit old (I’m waiting for an M1 iMac 27″), and I only have one backup at the moment…!! So what’s the best, most secure way to upgrade macOS and to replace both drives?

I also read on that WikiPedia page:
“Big Sur’s implementation of Time Machine in conjunction with APFS-formatted drives enables “faster, more compact, and more reliable backups” than were possible with HFS+-formatted backup drives.”

Is it really more reliable to use APFS for external backup hard drives now?

If yes, will I first have to format them to HFS+, make backups and THEN upgrade to Big Sur?
Then upgrade the backup drives’ format to APFS?

Many thanks.

LikeLiked by 1 person
- 9
  
  hoakley on July 14, 2021 at 6:54 am
  
  Thank you.
  I have a whole series of articles here about Time Machine backups to APFS which you may find useful reading.
  No, for local backups to APFS, TM doesn’t checksum the data/files, and there’s no way to turn that on, apparently. Neither does APFS offer this ability itself. Checksums are used in the file system metadata, though.
  I’m preparing an article for publication tomorrow (Thursday) which looks at how best to upgrade to Big Sur which you should find helpful. This morning’s on Time Capsules should also be of interest.
  Yes, TM to APFS is faster, more economical on storage space, and far more reliable. This is explained in detail in that series of articles here.
  There’s no point in formatting your new backup storage to HFS+ immediately before upgrading to Big Sur, particularly if you want to back up to APFS. You’ll be better formatting them to APFS and making a clone of your startup volumes using a utility such as Carbon Copy Cloner or SuperDuper!.
  Finally, there’s little point in using TM to back up just once a day. It’s designed to run every hour, and that’s the best way to use it. If you want daily backups, then use CCC or SuperDuper!, or ChronoSync, which are better suited to that.
  Howard.
  
  LikeLike
  - 10
    
    coxorange on August 11, 2021 at 5:11 pm
    
    Thanks for your answer.
    In the meantime I’ve read a lot of your articles about Time Machine and APFS, however I’m still not sure about the reliability and about what to do.
    
    Fact is that TM to APFS does not use checksums for user data/files.
    (I thought TM to HFS+ did it – can you confirm this?)
    
    If there are no checksums, there is no way to know (e.g. if one bit had flipped) whether data is still correct or destroyed. Right?
    
    So the only way to determine the integrity would be to compare the original data source (if still intact) with the backup.
    (I thought TM to HFS+ did check for integrity after copying via re-reading the files and comparing with the checksums – is this correct?)
    
    Another thought came to mind: If APFS does not use checksums for files on the internal SSD either (files restored from a backup to a new Mac or newly created files), then it would be unknown if these source files copied by TM were intact at all! (this could happen after some years of SSD usage). I hope I’m wrong, am I?
    
    Depending on the answers it might be “more reliable” to stick with TM to HFS+ as long as possible, until Apple improves this.
    
    (I would want to stick with TM and not use 3rd party apps like CCC.)
    
    Thank you!
    
    LikeLiked by 1 person
    - 11
      
      hoakley on August 11, 2021 at 10:08 pm
      
      Thank you.
      Both HFS+ and APFS are claimed to be able to verify checksums on backups, although I think that there’s some doubt about those on APFS.
      APFS backups are very different though: as they’re snapshots, they can’t be modified at all, so whether checksums have any useful purpose is open to debate.
      There’s a huge difference in reliability between HFS+ backups with millions of hard links, and APFS backups in which each file system is but a single backup.
      I don’t think that Apple has any intention of even maintaining backups to HFS+: Big Sur can’t start new ones, and I wouldn’t be surprised if Monterey doesn’t support them at all. If you want reliable backups with a future, then APFS is the only way to go with TM.
      Howard.
      
      LikeLike
12

coxorange on August 11, 2021 at 11:44 pm

Thank you.

> Both HFS+ and APFS are claimed to be able to verify checksums on backups

Oh, I thought APFS wouldn’t use checksums for user data/files. I’ve read that a lot of times. Has this changed lately?
In this case write operations on the internal HDD/SSD could (would?) be verified which would be a relief.

I’ve also read APFS isn’t (very) suitable for hard drives (my ext. TM drive) – has this changed too?

Unfortunately you couldn’t answer some other questions…

I wish I could simply trust APFS, but I still have some doubts.
On the other hand I never had problems with HFS+ and the speed and space-saving of APFS are not so important to me.
I just want the *most reliable* TM backups (still on ext. hard drives). IMO that’s most important.

LikeLiked by 1 person
- 13
  
  hoakley on August 12, 2021 at 5:58 am
  
  It’s entirely up to you to decide what you wish to use, so long as TM continues to support it, of course. All your questions are answered in the articles published here. You are of course welcome to ignore them, and to continue using a fundamentally unreliable file system like HFS+ for your backups. But Apple is already withdrawing support for that, in preference to APFS which is fundamentally much more reliable because of its use of copy-on-write and, in this case, immutable snapshots.
  Howard.
  
  LikeLike
  - 14
    
    coxorange on August 12, 2021 at 5:59 pm
    
    I really appreciate your articles! It’s often just what I need, when Apple gives no or insufficient information!
    I don’t want to ignore any of your advice, I just want to be able to understand/comprehend.
    
    > All your questions are answered in the articles published here.
    
    I’ve read many of your articles, probably I’ve not found the relevant one regarding reliability (checksumming/verify/data integrity) with APFS/Time Machine?
    I would appreciate if you could point me to that one.
    I just want to find peace of mind when using APFS for external TM HDD backups.
    I hope you can understand.
    Thank you!
    
    LikeLiked by 1 person
    - 15
      
      hoakley on August 12, 2021 at 10:41 pm
      
      Thank you.
      Before you even think about verification, the fundamental requirement for reliability is a file system which is designed to preserve integrity. There’s no point in trying to put water into a leaking bucket until you’ve fixed the leak.
      The bad news with backing up to HFS+ is that it’s an unreliable file system. It overwrites blocks when changing them, so in order to try to cope better with the inevitable errors that occur, it uses journalling, which doesn’t actually protect the data, just the file system. Take that, and in a single file system fill it with millions of hard links and it’s only a matter of time before there are file system errors and data loss.
      APFS doesn’t overwrite blocks, but uses copy-on-write. That ensures that the chances of any error occurring during write is exceedingly (if not vanishingly) small. TM to APFS doesn’t use hard links either, but creates snapshots, each of which has its own file system, which never changes. Maintenance vanishes, and chances of any data corruption are also effectively zero.
      The situation with TM verification isn’t clear; it appears to work on both HFS+ and APFS, but without deliberately corrupting your own backups, you’ll never confirm it. In any case, in APFS the snapshots are read-only, so you could’t test them anyway. It’s not particularly good verification either: it keeps checksums of the backed up files which can be compared against the original, so long as that hasn’t been changed. However, neither HFS+ nor APFS has any facility for checking file integrity.
      Howard.
      
      LikeLike
16

coxorange on August 13, 2021 at 2:35 am

Thanks a lot for your detailed answer. I will go with APFS then!
I think the leaking bucket convinced me.
Last bit of a question, if it’s seen on an “atomic” level:
If the source file is just “01” and it is backed up by TM, but there is a hardware defect on the drive, so that it arrives there as “00” – would such an error be detected?
Thanks again.

LikeLiked by 1 person
- 17
  
  coxorange on August 13, 2021 at 12:03 pm
  
  addendum:
  Maybe such an error would be detected by the drive itself and then reported to TM?
  If yes, would a drive need to have certain features for that?
  I intend to buy WD Elements Desktop (WDBWLG0060HBK) or WD My Book (WDBBGB0060HBK-EESN).
  Both are about the same price but I couldn’t find out if they use the same drive.
  Do you know which one is better?
  (Elements has a useful status LED and My Book comes with software I won’t need anyway.)
  
  LikeLiked by 1 person
  - 18
    
    hoakley on August 13, 2021 at 6:31 pm
    
    Thank you.
    If errors are detected during writes, they’re reported back to the app. However, you’d be surprised at how few apps handle such errors correctly: many simply write an entry in the log and carry on!
    I’m sorry, I don’t use hard disks any more. I always kept away from WD hard disks after bad experiences in the past, but they may have changed. My recent preference was for derivatives of the DeskStar range, which oddly are now made by WD, but are a different design.
    Howard.
    
    LikeLike
    - 19
      
      coxorange on August 14, 2021 at 1:19 am
      
      Thanks. I always kept away from Seagate HDDs after bad experiences. :)
      
      LikeLiked by 1 person
- 20
  
  hoakley on August 13, 2021 at 5:48 pm
  
  Thank you.
  Some disks do check data written in that way, but I’m not aware of any mechanism in HFS+ or APFS which does that. However, the chances of that happening are extremely (vanishingly) small when this is local storage. They are greater over a network, but network file system protocols incorporate schemes to prevent that from happening.
  Howard.
  
  LikeLike

·Comments are closed.

Share this:

Related