Routine, background activities shouldn’t be run strictly by the clock. Traditional scheduling systems like Unix
cron, and the Mac’s LaunchAgents, are fine for brief tasks which don’t put much load on the cores or hog hardware resources. But when the clock says it’s 23 minutes past the hour and that is time to do some heavy lifting, it’s only going to upset the user.
macOS now strives to do better. Most system activities which can be run in the background are now managed by a sophisticated scheduling service which avoids running a Time Machine backup when your laptop is running on battery power, or flooding cores with heavy tasks when you’re trying to get on with your work.
The snag is that, throughout Sierra, there have been problems in that scheduling service, which eventually lead to its collapse. Leave your Mac running long enough, for about a week or more, and those background tasks no longer get dispatched properly. Routine backups become irregular, or stop altogether for long periods, and many of the seventy or eighty other background activities are also forgotten.
A couple of days ago I was able to see this happening in front of my eyes – or, rather, in front of my tools, as I watched what was going wrong using T2M2 and DispatchView. What I saw suggested that Apple may have tried to address this long-standing bug, but it isn’t been fixed yet.
During the late morning, my hourly Time Machine backups seemed to skip a beat. I checked them using T2M2 just before 1300, when five backups had been run over the last six hours. There was clearly no cause for alarm, the traffic light was still green, but I decided to keep an eye on things.
When I ran T2M2 next, at nearly 1430, the light was red, and its report read:
Analysis for 2 hours from 2017-05-28 11:23:08 +0000 to 2017-05-28 13:23:08 +0000:
Backing up to /dev/disk3s2: /Volumes/PROMISE PEGASUS/Backups.backupdb
on which there are 2.9 TB available.
Started 1 auto backups, and 0 manual backups; completed 1 backups successfully,
last backup completed successfully 111.0 minutes ago,
backed up 642 files in the last completed backup,
total data for each backup was 175.1 MB.
Times taken for each auto backup were 7.0 minutes,
Created 1 new backups, and deleted 1 old backups,
No error messages found.
Running check on scheduling system (CTS and DAS).
DAS appears to have stopped rescoring activities over the last 90 minutes.
Running CTS check next.
CTS has not been told to run a backup in the last 90 minutes. Scheduling may have stopped.
Consider restarting your Mac.
Before restarting, I explored a little further using DispatchView, which looks at log entries for Duet Activity Scheduler (DAS) and Centralized Task Scheduling (CTS), two key parts of the macOS scheduling and dispatch system.
Normally, DAS maintains a scored list of all its scheduled activities. Every few minutes, or more often, it reassess their scores, and decides whether any need to be run. Those that reach a high enough score, approaching 1.0, it then tells DAS to run. By 1117, that orderly system had ceased working.
Instead of rescoring the activities in its queue in the normal way, DAS reported that it was
Evaluating 77 activities based on triggers
which is its fallback when the normal scoring system has broken down. This works far less evenly, and DAS was deciding to run a lot of activities at a time, then going silent for long periods. At times its messages indicated its distress, referring to actions performed
"because conditions are deteriorating". Some activities which it had started were also going awry, and it was reporting that those were ‘CANCELED’.
As a result, CTS was being inundated with DAS instructions to run tasks, then left with nothing to do for a long time. Suddenly at 1127, for example, CTS ran the following tasks in a whole bunch:
Those give you a good idea of the now pervasive use of these scheduling services by a wide range of background services.
So as of 1127 or so, DAS and CTS were still running routine Time Machine backups, but rather erratically.
An hour later, DAS and CTS were on their knees, and failing to cope. From 1232, DAS went silent altogether for fifteen minutes. When it did start writing to the log again, it was erratic at best. Finally at 1252, it again noted that
"conditions are deteriorating", then tried to re-assess the activities in its queue:
Evaluating 0 activities based on triggers
Backups were still limping along, though. One completed at 1232, and (as usual) CTS then calculated the time window for the next backup as being between 1307 and 1337, and passed that back to DAS to dispatch. But by the time that period came round, DAS and CTS had got preoccupied with a succession of background tasks to support Safari, and Time Machine’s overdue backup remained forgotten until after 1426. By then, scheduling and dispatch had ceased altogether: DAS and CTS had been completely silent for over forty minutes.
When this happens, what I’d really like to be able to do is to restart DAS and CTS. That should enable their normal service to resume, allowing servers to continue running without interruption, for example. But these scheduling and dispatch systems are undocumented, and there are no tools available to work with or control them.
For Time Machine backups, if you can’t restart your Mac, you can revert to a LaunchAgent running
tmutil startbackup -a -r
What worries me are all the other background activities for which we rely on DAS and CTS. There are seventy or eighty other features – including software updates (and silent security patches?), iCloud, address book and other synchronisation, metadata updates for Spotlight search, security checks, and more – all of which become erratic, infrequent, and may stop altogether.
Time Machine backups are only the tip of the iceberg. When DAS and CTS go wrong, you really don’t want to leave your Mac running for much longer before restarting, even if it is a server. In fact, because it is a server.