Time Machine: 1 How it works, or fails to

This is the first in a series of articles in which I will try to explain much of what I know about Time Machine (TM), starting from its basic principles, how it is implemented in macOS from Sierra (and earlier) to Catalina, and how to troubleshoot and fix its problems. I didn’t intend writing a series, just a single article about its issues in Catalina. But without understanding what is going on when backing up, that didn’t make much sense. This first article explains the principles involved, how they’ve changed over different versions of macOS, and the tools you need for diagnosis.

Making backups is, in principle, a simple task. The first backup just consists of a copy of everything that the user wants backed up. Each backup after that is then a copy of everything that has changed since the last backup was made. When you want to restore any item(s) from your backup, you can select which version of them to retrieve, going right back to the very first.

TM is only one choice among a range of backup systems for macOS. It’s distinctive and popular because:

  • it’s free and bundled with macOS;
  • it’s well integrated with macOS;
  • it uses macOS to create a Finder illusion, with which users are familiar;
  • by default, it makes small backups each hour, which fit in with most usage patterns better than making large backups each night;
  • it generally works well, and can back up to a wide range of media, which needn’t be costly.

Other strategies are adopted by competitors, including most notably Mike Bombich’s Carbon Copy Cloner and David Nanian’s SuperDuper!

Origins

For many years, TM relied on two features which were distinctive of macOS:

  • the FSEvents database, which records changes made to the files and folders on each volume;
  • hard links to both files and folders, which are used to minimise the size of TM backups.

Hard links are an incredibly efficient way of making each backup look as if it’s a complete copy of the original, when in fact all those files and folders which have remained unchanged since the last backup are represented by hard links back to the previous version. Hard links to files are a common feature of file systems, but hard links to directories (folders) are not normally supported. Apple added them to HFS+ for this purpose.

Without them, this scheme wouldn’t work: unchanged folders would have to be created in the backup as real folders containing many more hard links to the files within. The number of hard links in each backup would quickly become huge, as every file in that backup would have to be represented either by a hard link or a new backup. If your internal storage contained one million files, of which only ten needed to be backed up, TM would have to create 999,990 hard links for that one backup alone.

At its earliest and simplest, TM’s backup service, backupd, was run every hour as a scheduled task. It looked at the FSEvents database on each volume it had to back up, discovered what needed to be copied into the new backup, copied those items across to the backup, and created the hard links required to make that look like a complete duplicate of the original.

TMbackup105

The final maintenance phase ensured that hourly backups cover the most recent period. Prior to that, TM reduces the number of backups to minimise demands on their storage. However, the total size of backups inevitably rises until eventually even the largest backup storage is exhausted.

Diagnosing problems in those early backups is straightforward, thanks to TM making relatively infrequent log entries marking each step of the way.

Sierra

By the release of macOS Sierra, Apple had modified this simple behaviour in two respects. First, to ensure that backups remained within the scope of Spotlight search, once this sequence is complete, Spotlight’s mdworker processes index the latest backup into the main Spotlight indexes. The other change was to increase flexibility in the timing of backups. There is no need for them to occur precisely every hour, so Apple added TM backups to a complex background despatching system which aims to call them off when the system is ready for them to run. This despatching system is run by two subsystems, DAS (Duet Activity Scheduler) and CTS (Centralized Task Scheduling), and aims to run each hourly backup within a period of 5-10 minutes around the intended hourly interval.

TMbackup1012

Unfortunately, there is a bug in Sierra which Apple never fixed, leading to failure of the DAS-CTS scheduling system after running continuously for about 5 days or more. To this day, if you run TM in Sierra and don’t shut down or restart your Mac, automatic scheduling of TM backups will eventually stop happening regularly.

High Sierra and APFS

In addition to fixing this scheduling bug in High Sierra, that new version of macOS brought a complete new file system, APFS, designed to offer file system snapshots. A snapshot is a complete copy of the file system metadata at an instant in time. By retaining the previous versions of files rather than overwriting them immediately, a snapshot can be used to restore a volume to that previous state extremely quickly. This replaces the older Mobile Time Machine, introduced in about 2015, which had intended to provide limited backup facilities when proper backups couldn’t be made to an independent volume.

Apple saw the opportunity to replace the FSEvents database as the means of working out what to back up. In High Sierra and Mojave, snapshots are made and TM analyses those to determine what has changed and thus needs backing up. This turns out to be a bit more complex than it sounds, requiring a second snapshot to be made once backing up is complete, and is in any case only available when backing up APFS volumes.

TMbackup1014

Catalina

TM had to change in Catalina anyway, because of the use of a Volume Group for what had previously been the single startup volume. For users wanting to back up the whole of the Volume Group, the default in 10.15, this brought a change to the structure of backups. Apple has also taken the opportunity to back up the Recovery volume, which previously wasn’t an option in TM. The effect for many users who have only been backing up a single startup volume is that Catalina now backs up three: the read-write Data volume, read-only System volume, and the normally unmounted Recovery volume.

Backing up the Data and Recovery volumes brings further challenges, as neither was suited to the new technique of determining what needed to be backed up as the difference between APFS snapshots, or snapshot diff. It seems that technique wasn’t performing particularly well with plain APFS volumes such as the Data volume anyway, and Apple decided to return to using the FSEvents database.

This gives TM in Catalina a choice of four different techniques to determine what to back up from each of two or more volumes:

  • first backup, in which the entire contents of the volume are backed up, as happens when a volume is first backed up;
  • deep scan, in which the entire folder heirarchy is searched exhaustively, and all items which have been modified since the last backup are included in the list to be backed up (default for the Recovery volume);
  • FSEvents, in which TM analyses the FSEvents database and backs up all items which are recorded there as having been modified since the last backup (default for other volumes);
  • snapshot diff, in which TM compares the latest with the previous snapshot, and calculates what has changed between them (not available for HFS+ volumes, and probably only used as a fallback for APFS volumes when FSEvents is deemed unreliable or missing).

TMbackup1015

What was originally three fairly simple steps in 10.5 has now become five steps, with the second particularly complex, choosing between quite different techniques for determining what to back up.

Troubleshooting

TM problems broadly fall into three categories:

  • those which prevent the first backup from completing,
  • those occurring when regular hourly backups should be occurring,
  • those preventing a backup from being used, typically to restore items.

These are made more complex by the fact that TM can now back up to multiple destinations, including both local and networked storage (NAS). As the latter isn’t normally to an HFS+ volume expected on local storage, TM there uses a different strategy of copying items using SMB to a sparsebundle hosted on the network storage.

Taking the simplest cases first, there are two main classes of problem to be tackled:

  • diagnosing an incomplete backup (normally the first) to a local HFS+ volume,
  • discovering what problems have been occurring during periodic backups made recently.

Tools

TM has always been a considerate and informative user of the log. The best way to tackle backups which are still in progress is to examine TM’s entries for that backup in the log. In El Capitan and earlier, that is easy to accomplish using Console, but from Sierra to Catalina the new unified log presents more of a challenge. Console in 10.12 and later is essentially unable to examine recent log entries, and can only stream fresh entries as they occur, unless you fancy making a logarchive and trying to browse that.

The best GUI tool for examining recent log entries in Sierra and later is my free Consolation 3, which defaults to settings which make this task very simple, however daunting its main window may appear.

consol3tmcat

Simply leave the Filter with the Time Machine radio button selected, select the syslog Style (although custom styles can be much better when installed), use the Period stepper to change the period to 1, 2, or whatever, select hour in the next popup menu, and click on Get log.

Although you can plough through hourly batches of TM log entries looking for problems, once TM is making reasonably regular backups and you want to check them, use my free T2M2 utility. That analyses logs over the chosen period and reports any problems found, together with basic statistics. The latter now includes a breakdown of TM’s use of different strategies for determining what to back up. Although you can run T2M2 while a backup is still being made, it isn’t designed for that, and results then need careful interpretation as they contain an incomplete backup.

t2m2191

In the next article, I will look in more detail at the logs of backups and explain what can go wrong in them.

This series is dedicated to James Pond (1943-2013), who really did know everything about Time Machine.