Explainer: deduplication

Selling ‘housekeeping’ utilities for macOS is a lucrative trade, and one feature which we all succumb to is the promise of deduplication: finding multiple copies of files, so we can delete unnecessary ‘clutter’ and free up storage space. Back in the days of HFS+ I can recall housekeeping apps which have freed up more than ten percent of a hard disk in the space of a few minutes. This article explains why deduplication may now achieve little real saving of space, and why the savings it might claim are almost certainly gross overestimates. In reality, on a modern Mac, duplicates often come free, and don’t even clutter your backups.

Most of us still think of file systems in simple terms, like HFS+. Duplicate or copy a file, and macOS copies all the data to a new storage area and makes that a different file. Thankfully, APFS is much smarter than that, and regularly – indeed, as a rule whenever it can – doesn’t copy any data at all. What it does is create a clone file, which is a bit like a hard link, in that the file record points to the same data as the original. Unlike a hard link, the clone is a separate file, with its own iNode.

The conditions which have to be met for macOS to create a clone are simple:

  • both the original and copy files must be on the same APFS volume, so sharing the same file system;
  • copying must be performed using either of two specific commands (both forms of copyItem()) in the FileManager.

In practice, these include all copies and duplicates made within the same volume by the Finder, and most made by apps. This can also apply to whole folders, provided that they’re copied according to these rules.

It’s easy to demonstrate this. Find a large file, at least 10 GB in size. Select it in the Finder and use the Duplicate command in the Finder’s contextual menu (Control-click, etc.). That file, which would take several seconds at least to really copy, is duplicated instantly. No data has been copied or duplicated at all.

Where this gets a little confusing is that the Finder doesn’t tell you that the duplicate takes no extra space. Put three duplicates in a folder, and the Finder assures you that they take three times the space of one of them, but that simply isn’t true. What’s more, and this is perhaps the most important point, when Time Machine in Big Sur backs them up to an APFS backup store, it doesn’t copy three files, just the one and two clones.

So, when you’re working on APFS volumes, within each volume, copies and duplicates come free, without any need for disk space, and don’t steal space in your backups either (macOS 11, Time Machine to APFS only). If you want to keep one copy of an important document in your app work folder, and another in a project folder, then feel free to do so, as the copy should be a clone and won’t waste storage space, whatever the housekeeping app might claim.

Why, then, don’t housekeeping apps spot this, and at least inform you that there’s no point in deleting those cloned duplicates? The answer is a shortcoming in macOS: APFS and Time Machine may know that a file is a clone, but everything else, including third-party apps, is left in the dark, and can’t tell whether any given file is a clone or not.

File cloning has been saving space in APFS since it was introduced in High Sierra, but it’s only in Big Sur that macOS provides any means of telling whether a file might be a clone. Even there, the information available is minimal: it simply lets an app know that a particular file has at some time been cloned or made as a clone. If one file of a clone pair is changed, then it starts requiring real storage space to hold the changed data, and it’s currently not possible for a third-party app to know that, or to know how much space is taken by the changed data.

So, should you take time and reorganise your files by deduplicating? Only if you really want to clean up copies. Don’t expect it to free significant storage space, either on that volume or in your backups, as many of those deleted duplicates may not have really taken any additional disk space at all. APFS is very clever, but it can also be profoundly confusing.

precizeBS2

I have two free utilities which can indicate when a file might be a clone: Sparsity and Precize. Have fun with them.