Has Apple fixed the bug which made Time Machine backups irregular?

Within a fortnight of the release of macOS 10.12 Sierra, I and some other Mac users noticed something very odd with Time Machine backups: after a Mac had been running continuously for a few days, backups which had previously been regular, every hour, suddenly became irregular. At times, they’d skip several hours, and at others they’d be called so soon after the last backup that Time Machine refused to play.

At first I thought that it might be a new, undocumented feature, but after a while it became clear that this was a bug. I presumed at that time – when accessing Sierra’s new log was still very difficult – that this was a problem with launchd, the service which launches scheduled tasks.

The circumstances required to reproduce the bug are quite unusual, which is why it has not troubled many users. It requires that the Mac is left running continuously for more than a week or so. As the great majority of users shut their Macs down daily or weekly, they would be extremely unlikely to encounter the problem.

I have never been sure whether letting the system go to sleep each night prevents it from occurring: my iMac17,1 is unable to enter system sleep without sleeping the hard disk in its Fusion Drive, something that I am not prepared to let happen, so my Mac stays awake 24 hours a day, and is only restarted when I really have to.

In such circumstances, I observed the bug occurring, and Time Machine backups becoming erratic, typically after about ten to fourteen days of continuous running, although it was once as short as five, and has been as long as eighteen or nineteen. Once it has occurred, the solution is to restart, which seems to reset the stochastic timer on the bug.

To investigate this further needed two tools which I did not have last year: Blowhole, a command tool which I could use to write a simple message to the log every hour, and Consolation, a decent log browser which helped me analyse log entries which had already been made. Trying to use Console was an utter waste of time and effort. (Blowhole and Consolation are both available free from the Downloads item above.)

It was only when I discovered that Time Machine backups are not triggered by launchd, but by a far more complex mechanism involving Grand Central Dispatch, that I was able to trace the bug further. Using Consolation, I followed the sequence of calls which were initiating backups when they were running hourly, and when they had become irregular.

Grand Central Dispatch (GCD) is a very peculiar and essentially undocumented part of macOS which has a finger in many different pies. It schedules and runs periodic system tasks, like Time Machine backups, but does so to minimise power consumption, and to lessen their impact on use. If timed backups were still run using the old launchd mechanism, they’d occur pretty well on the dot, every hour. In fact they are less regular, wandering around within five or ten minutes of scheduled time.

This is because a part of GCD, apparently known as the Duet Activity Scheduler (DAS), maintains a weighted priority list of tasks which could be run. Periodically it re-evaluates the priority of items in that list by calculating their score, on a scale of 0.0 to 1.0. There are typically around seventy or more tasks in the list kept by DAS. When a task reaches a sufficiently high score, normally at least 0.8, DAS tells the Centralised Task Scheduling Service (CTS) to run that task. In the case of timed backups, this then fires up the backupd-auto daemon, which gets the backup going.

What was happening when my backups became irregular was that DAS was in trouble, writing to the log that ‘conditions are deteriorating’, and failing to tell CTS to run the task, so the backup was skipped. This implied that other routine tasks would be affected, and that GCD itself had broken down. The other symptoms – the time period required for this to happen, and its irregularity – suggested that this was the result of a cumulative bug such as a memory leak.

I reported the bug to Apple, gave them links to my reports here, and sent them a lot of additional material. It looks as if this bug has been fixed in the 10.12.4 update.

gcdbugfix

This iMac has now been running continuously, without system sleep, for over twenty-three days. Time Machine backups are still being made hourly, and have shown no signs of becoming irregular. This is the longest that this iMac has ever gone between restarts in the eighteen months that I have had it, as it suffered kernel panics every few days when it was running El Capitan.

I am optimistic that Apple’s engineers may have fixed this, and would like to thank them here in the hope that they have. For those of us who need to run our Macs for long periods without restarting, this is a big step forward.