Panics, Pegasus, and Recovery

My iMac Pro running Mojave 10.14.1 just had two kernel panics, the first resulting in an auto-restart, the second freezing shutdown. Not only do I think I know why, but in dealing with them some interesting changes have come to light. If you back up using Time Machine in High Sierra or Mojave, the next time you run First Aid on your boot volume in Recovery Mode, be prepared for a very long wait.

My Mac had spent much of the afternoon asleep. I got home, woke it up, and started working through recent tweets. I was just replying to one when its display went ominously black.

There are several things that a suddenly black display can indicate. Only ten days old, my iMac Pro could have suffered a severe hardware failure, and effectively be dead. Unfortunately, macOS doesn’t give any indication that a Mac is merely rebooting after a severe kernel panic: you just have to wait and hope that the Apple logo will reappear. It did.

When it had restarted, everything looked and worked fine apart from Time Machine. It got stuck in what appeared to be an endless backup. My next stop was to look at the backup folder, on my old Promise Pegasus R4 RAID array. That came as something of a shock: viewed in the Finder, my backup folder had vanished. I decided to restart, and see if it mounted properly again and might kickstart the next backup. It was then that my Mac hung, trying to restart. All I could do was to force shutdown using the Power button.

Now I was concerned. That looked like (and was) two kernel panics in quick succession. I connected my wireless keyboard and trackpad using the charging leads and decided to start up in hardware diagnostics, holding the D key.

On the iMac Pro, diagnostics is surprisingly bare-boned, just a little progress bar and terse text display. Just over four minutes later, it reported that my hardware was healthy, and offered to check out my available support options. I declined those, and restarted into Recovery Mode (Command-R) to run Disk Utility on each available drive.

This is the first time that I have run First Aid in Disk Utility from Recovery Mode on an APFS boot disk which has been making hourly snapshots as part of Time Machine backups. It brought a surprise: when running First Aid, every one of the eighteen snapshots on my startup volume was checked, very slowly indeed. Although that is good First Aid, and potentially very important, it meant that checking my 1 TB internal SSD, which is only half full, took well over half an hour.

DiskUtilRecoveryMode

The startup disk was fine, so I then checked my 1 TB external SSD, dreading to think how long verifying its eighteen snapshots would take. Strangely, given that it is 70% used so those snapshots should have been significantly larger, each took only a few seconds to check.

I now knew that my Mac and its two SSDs seemed healthy, so it was time to restart in normal mode, and see if it would mount my external RAID. It didn’t, but at least I could now sort that out while I continued with my work. I had to shut the Pegasus down using its Power button, leave it a few seconds, and start it up again to get it to mount. Once fully available, its backup folder Backups.backupdb appeared intact and just as I had left it earlier in the day.

DiskUtilRAID

I then set Disk Utility running First Aid on the RAID volume, to ensure that its HFS+ file system hadn’t been damaged. It eventually reported that it too was fine.

While that was going on, I recovered the crash reports for the two kernel panics from /Library/Logs/DiagnosticReports/ProxiedDevice-Bridge, where they had been moved into the Retired folder as I had opted to send both to Apple. They pointed the finger at the kernel extension com.promise.driver.stex, which provides support for my old Pegasus R4 RAID. I checked that I was running the latest version (6.2.9), finding none more recent. The Panic String starts:
panic(cpu 6 caller 0xffffff801d4c306f): UUID: C592A1B0-0777-44E2-8305-21662CB6EB42
Stackshot Reason: Wake transition timed out after 180 seconds while calling power state change callbacks. Suspected bundle: com.promise.driver.stex.
Kernel Extensions in backtrace:
com.promise.driver.stex(6.2.9)

So now it’s time to reconstruct what had happened, on the evidence available.

What seems to have led to the kernel panic was waking from sleep, which unmasked a bug/conflict in com.promise.driver.stex version 6.2.9 which hasn’t reared its head during normal use. In the automatic restart after the first panic, the Pegasus RAID array was left in a crippled state, as if erased. When Time Machine then tried to run its next backup, it couldn’t find the backup folder, and just hung, reporting
Not starting scheduled Time Machine backup: No destinations resolvable

With the new implementation of Time Machine in High Sierra and Mojave, although it couldn’t proceed with the automatic backup, it was still able to make its local snapshots, providing a valuable fallback in situations like this.

The lessons are:

  • Running First Aid in Disk Utility in Recovery Mode on a startup volume with hourly snapshots takes a very long time.
  • Until Promise fixes this probable bug, I have disabled system sleep. The last desktop Mac which I had which could be trusted to sleep and wake reliably was my Mac Pro 8-core in 2007.
  • Console is utterly useless for browsing logs in this situation, as it can’t browse back in time without making a logarchive.
  • How important tools like T2M2 and Consolation are when things go wrong. I need to get on with developing Consommé, my next-generation log browser.
  • Don’t underestimate the value of Time Machine’s hourly snapshots: in Sierra, Time Machine wouldn’t have made any form of backup here for over four hours.