WindowServer-GPU crash: different from a kernel panic

You’re probably familiar with most types of ‘crash’, from uexpected quitting of apps through kernel panics, but would you recognise a WindowServer crash? They range in severity, sometimes just freezing the display for a few seconds, but at their worst can result in unique behaviour – something I experienced for the first time a couple of days ago. This article attempts to explain what to look for, and how to respond.

The key feature here is WindowServer, a central part of macOS, which is fundamental to its GUI. Part of Core Graphics, WindowServer gathers in images of each of the windows in the system, assembles them into the composite image which will appear on the display, and passes that onto the GPU and display system. It’s responsible for working out where on the display you have clicked or tapped, thus which window will receive that event to process, and despatches that event to the appropriate app. Without WindowServer, there can be no GUI, and no handling of clicks/taps.

As you might then expect, when WindowServer and the GPU crash, everything on your Mac’s display freezes, including the clock. You should still be able to move the pointer about, but all windows are unresponsive, and your Mac is out of your control. In simple WindowServer crashes, the service is restarted after a few seconds, might catch up with some of the backlog of clicks/taps, and carries on.

Worse cases, in which the GPU is involved, aren’t so readily recoverable. What happens next might set you thinking that it was a kernel panic: the display goes black for a few moments. But instead of your Mac restarting as it would after a panic, you’re then invited to log in. During the black screen phase, the GPU may be restarted, and once it’s up and running it should resume normal function. You’ll then be logged in, and provided this wasn’t the result of a hardware fault, everything should resume as normal.

If you’re already familiar with kernel panics, this can be most confusing. Your Mac hasn’t actually restarted, but something logged you out and back in again. Could it be malware perhaps? It’s probably wisest to make sure your work is saved at that stage, close your apps down in an orderly fashion, and restart, although you could equally choose to carry on working until that’s more convenient.

The only way of confirming what happened is to browse the log – and that’s not easy. In my case, over the course of the couple of minutes in which this took place, well over 100,000 log entries were written. One way to work out what happened when is to use Ulbow’s chart feature.

windowserv01

Viewing all log entries over a two minute window in which this crash took place, there’s an obvious double-peak.

windowserv02

Entries from com.apple.CoreDisplay peaked earlier, though.

windowserv03

And there were two distinct peaks in kernel entries, one just after the barrage from com.apple.CoreDisplay, the other rather later.

The first obvious sign in the log of a problem is the start of a long succession of entries such as
20.651596 Fault kernel IOAcceleratorFamily2 void IOAccelFenceMachine::fence_timeout(IOTimerEventSource *): AMDRadeonAccelerator prodding blockFenceInterrupt
which continue frequently for a period of around 5 seconds. (Times are all given in decimal seconds.)

Those are followed by many entries reporting that the GPU driver has hung:
25.566111 WindowServer SkyLight GPU Driver for display 0x042ba88c appears to be hung (5 continuous seconds of unreadiness)
appearing every 0.005 second, and progressively counting up the time.

There are further signs of big trouble in the GPU:
31.502727 Fault kernel IOAcceleratorFamily2 virtual void IOAccelEventMachineFast2::checkGPUProgress() - Signaling hardware error on channel 14..
31.502729 Fault kernel IOAcceleratorFamily2 void IOAccelEventMachine2::signalHardwareError(eRestartRequest, int32_t): GPURestartSignaled stampIdx=14 type=2 prevType=0 numStamps=22
31.502731 Fault kernel IOAcceleratorFamily2 void IOAccelEventMachine2::signalHardwareError(eRestartRequest, int32_t): GPURestartEnqueued stampIdx=14 type=2
31.502748 Fault kernel IOAcceleratorFamily2 void IOAccelEventMachine2::hardwareErrorEvent(): setting restart type to 2 (channel 14)
31.502750 Fault kernel IOAcceleratorFamily2 void IOAccelEventMachine2::hardwareErrorEvent(): GPURestartDequeued stampIdx=14 type=2
31.502752 Error kernel AMDRadeonX5000 [243:0:0]: channel 14 event timeout
31.567235 Fault kernel IOAcceleratorFamily2 AMDRadeonAccelerator: IOAccelDisplayPipe::transaction_wait_gated Timeout (1000 millisec). last VBL was 14 millisec ago. readCount=3152550, writeCount=3152552
31.567238 Fault kernel IOAcceleratorFamily2 Transaction[3152550] IOSurfaceID=15 fErrorCode=0x0, Rendering not finished

Skipping ahead 2.5 seconds, this has reached WindowServer too:
34.009642 WindowServer CoreDisplay [WARN] - Failed to receive all transaction callbacks
34.010334 Error WindowServer CoreDisplay [ERROR] - AssertTracer: 0x0000000000000002 non-survivable - isReady - AMD: AMD transaction hang or work overload (AMD IOFB)

followed by a dump of GPU information.

A little later, macOS is apparently preparing to take action by recording CoreDisplay’s status:
34.024887 WindowServer CoreDisplay [WARN] - CoreDisplay State - Begin
followed line-by-line by a property list, within which is the tell-tale
34.024890 WindowServer CoreDisplay [WARN] - <key>CrashLogMessage</key>

Soon afterwards, WindowServer tries to initiate a core dump, but is refused
34.030264 kernel AppleMobileFileIntegrity AMFI: Denying core dump for pid 345
and is abandoned, with the session being closed down and the user logged out
34.030703 VDCAssistant Window Server died attempting reconnect
34.030705 VDCAssistant Window Server connect attempt 2
34.030737 VDCAssistant Window Server Is Connected 1
34.030737 VDCAssistant Releasing our resources and shutting down the CG connection
34.030766 loginwindow -[LoginApp windowServerExited] | enter
34.030773 loginwindow -[LoginApp windowServerExited] | ERROR | Window Server exited, closing down the session immediately
34.030780 loginwindow -[SessionLogoutManager startDirectLogout:reason:] | Enter, logoutType: Logout, directLogoutReason:WindowServerExited

Individual processes are then notified of WindowServer’s death, so that they can prepare for the inevitable. For example,
34.030801 Alfred HIToolbox HIToolbox: received notification of WindowServer event port death.

Three seconds later, and nearly 12 seconds since the GPU first got into trouble, there’s evidence of WindowServer restarting:
37.039503 VDCAssistant Window Server Is Connected After reset 1
37.040973 ReportCrash Parsing corpse data for process WindowServer [pid 345]

Finally the user logs back in, with the appearance of the default desktop image:
39.059044 loginsupport _LUICopyDefaultDesktopPicture: The default desktop picture was loaded

I hope this doesn’t happen to you, but at least you now know what to look out for, and the evidence to seek in the log. I wonder if this is a relatively recent preferred alternative to a full kernel panic and restart.

In this case, although much harder to see in the log, I think this was all brought about by a third-party application which had a misunderstanding with the AMD Radeon graphics card driver, which in turn brought WindowServer down.