On 15 July, Apple released the update to take Catalina to version 10.15.6, widely expected to be its last update before the arrival of Big Sur later this year. What should have been an uneventful update, fixing the last few bugs before Catalina is put on the shelf, turned into the most problematic of the 10.15 cycle. This article looks at what went wrong with it.
Although Apple has recently been improving the information which it provides about updates, its notes for 10.15.6 are brief and lack any technical detail. Specifically, it makes no mention of any changes in kernel extensions, graphics drivers, nor any other change which might make the user suspect they had altered significantly.
Security release notes list several fixes which could be related to kernel extensions, but again they lack the detail to identify which components had actually been changed in the update.
My own analysis, which was based on looking at version and build number changes in /System/Library, concluded that many new builds of kernel extensions were included in the update, and I singled out AMD, AppleIntel, and others as worthy of mention.
In late July, users of VMware Fusion and VirtualBox virtualisation apps started to report kernel panics which occurred when they left VMs running for long periods. One of VMware’s excellent engineers, dariusd, investigated these, and was able to reproduce the problem. Following his work with debugging tools, he identified the problem as resulting from a memory leak in a kernel extension which had been introduced in the 10.15.6 update.
The sequence of events which was pieced together was that a very large number of Mach zone memory allocations were performed, mainly in the kalloc.32 zone, which led to zone_map exhaustion, WindowServer becoming unresponsive, and progressing in many cases to a kernel panic. Darius discovered that com.apple.security.sandbox alias Sandbox.kext was allocating millions of blocks of memory, and on 27 July he reported:
“We have narrowed down the problem to a regression in the com.apple.security.sandbox kext (or one of its related components) included in macOS 10.15.6 (19G73)”
which he reported to Apple.
By a curious coincidence, although I don’t currently have VMware installed, I suffered a very similar incident in the small hours of 6 August, which I described here. In my case, the kernel zone worst affected was kalloc.48 rather than kalloc.32. My panic log implicated com.apple.iokit.IOAcceleratorFamily2, com.apple.kext.AMDRadeonX5000, com.apple.filesystems.apfs and their dependencies.
At the time of that panic, the running Darwin Kernel Version was “19.6.0: Sun Jul 5 00:43:10 PDT 2020; root:xnu-6153.141.1~9/RELEASE_X86_64”.
Apple fixed the problem – as far as we tell – in macOS 10.15.6 Supplemental Update, released on 12 August, and incremented build numbers on a number of kernel extensions, including AMD[n]Controller.kext, AMDRadeonX4000[*], AMDRadeonX5000[*], and AMDRadeonX6000[*] extensions. It also reverted the kernel to an older version “19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64 x86_64”. That’s still more recent than the kernel in 10.15.5, which was released on 26 May, and its Supplemental Update from 1 June.
Kernel zones now appear far more stable. Before the Supplemental Update, kalloc.48 and kalloc.64 climbed to over 400 MB and 300 MB usage after a night staying awake. Now, when in normal use, those zones stay well below 20 MB, and show no signs of persistent growth. VMware and VirtualBox users also report that their problems running VMs for long periods have vanished.
For four weeks, this severe kernel zone memory leak stopped Macs in their tracks. It’s hard to see how this bug ever made it into a beta release, let alone the full user release of 10.15.6. It doesn’t require arcane tools or special builds of the kernel to detect: simply using sudo zprint kalloc
is sufficient to demonstrate large and ever-growing usage in one or more of the kalloc.n zones which are the telltale. Unfortunately, Activity Monitor gives little insight into kernel zone memory usage, and suspicion was only likely to have been alerted when the zone_map was already exhausted.
Apple’s kernel engineers have done well to deliver the fix in just a couple of weeks: thank you.