Last Week on My Mac: On M1 chips 1 + 1 = 4

Over the last few weeks, I’ve been looking at how macOS manages the use of the Efficiency (E, Icestorm) cores in both the original M1 and new M1 Pro chips. Because the older chip has a cluster of four E cores, but the newer chip only two, at first sight Apple has halved the capacity of its more powerful Apple Silicon Macs to run all the background tasks of macOS. But when those cores are loaded with test code, the two designs perform differently, as this article explains.

First, let’s look at this using a simple test written in assembly language, which performs a very large number of floating-point calculations, sufficient to fully load an E or P core for a few seconds. As well as timing how long each test takes to complete, I’ll here show what you observe in Activity Monitor’s CPU History window.

m1miniQoS9_1-8mixed

In this sequence of tests on my M1 Mac mini, I increased the number of identical test processes from 1 at the left to 4, 6 and 8 at the right. Over the first four tests, the height of the bars increases until they reach 100% on each of the four E cores, but the width of each test (time taken) remains constant at around 5-6 squares until the number of processes exceeds the number of E cores. There’s essentially no activity on the P cores, though, as these test processes are given a ‘quality of service’ (QoS) which confines them to be run on the E cores alone.

m1proEcores1-4neon

Here’s a similar sequence of 1-4 processes on the M1 Pro. Although the CPU % for the single process (left) was given as 100%, it was shared evenly over the two E cores, and the eight P cores remained near-idle throughout. There’s one tantalising detail which you might easily miss: the width (duration) of the second test with 2 processes is obviously smaller than the first test with a single process. Similarly, the peak of the third test with 3 processes is narrower than the first.

My next step was to quantitate this effect. Here I show results from just one of six tests used, ranging from integer arithmetic to vector processing, which all show the same phenomenon.

m1allCoresFloat

Looking first at the solid line, that’s a linear regression through the loop throughputs measured as billion (10^9) loops per second, against the number of processes run on P cores (in both M1 and M1 Pro chips). That has a gradient of about 0.15, indicating that each P core runs this floating point maths code at a rate of 150 million loops per second. The broken line is the equivalent regression for the four E cores in an original M1 chip, each of which runs the loop at a rate of 50 million loops per second, a third of a P core.

Between those well-fitting regression lines is one point, marked with an x, which is anomalous. That’s the result for running two processes on the two E cores in an M1 Pro, which runs at about twice the speed of two processes on the four E cores of the original M1, and very close to the speed reached by all four of that chip’s E core cluster.

All other factors being equal, the two E cores of the M1 Pro can deliver the same performance as the four of the original M1.

The most obvious explanation is that, to generate this increased performance, the E cores run at a higher clock speed. As that would need to have doubled, if the starting speed was close to the maximum available for the E cores, of 2064 MHz, then both E cores would have to run at close to 4000 MHz; even the P cores are limited to 3228 MHz, so that doesn’t appear feasible.

I therefore measured each core’s clock speed while running the tests, using the powermetrics command tool. I’ll be publishing full results from those tests tomorrow, but can confirm that this anomaly results from differences in clock speed which remain within the expected range. This is because macOS runs processes on M1 Pro E cores differently from original M1 cores.

On an original M1, with its four E cores, low QoS ‘background’ processes run with the core clock speed at around 1000 MHz, regardless of the perecentage active residency of each core. When running one test process, total active residency is 100%, the equivalent of one of those cores being fully active. As the number of processes increases, each adds another 100% until, with four processes, the four cores are each at 100% but the clock speed remains at around 1000 MHz. This is highly energy-efficient: a single E core running at that speed uses around 30-50 mW, and all four use around 200 mW.

The two E cores on an M1 Pro are managed differently. One process runs with the same settings as on the original M1, at a clock speed of around 1000 MHz. But when a second process is added, resulting in 200% active residency, the clock speed is doubled to nearly 2000 MHz. The effective throughput then matches that of the original M1’s four E cores still running at 1000 MHz, with a similar power consumption of almost 200 mW.

Refer back to the graph above to compare the M1 Pro’s E core performance with that of a single P core. With its two E cores running at almost 2000 MHz and a power of under 200 mW, it outperforms a single P core, which is running at 3220 MHz and just over 1 W in power. The E cores are doing significantly more on less than a fifth of the power.

The other situation in which test processes are run on the E cores is when they overflow from the P cores at high QoS. Load the chip with ten processes, and all ten cores reach 100% active residency. Once again, the E cores are running at high frequency, at their maximum clock speed of 2064 MHz, but the two are only drawing a total of 205 mW power. Each of the two clusters of P cores, running at their maximum clock speed of 3228 MHz, is using 4 W of power.

Those preliminary results are given in the summary table below.

ecorepmetrics1

So why does macOS control the E cores differently on the original M1 and the new M1 Pro?

Heavyweight background processes, such as post-boot MRT scans, Spotlight metadata store maintenance, and Time Machine backups can result in 100% active residency of all the E cores. If the two in the M1 Pro were to continue to run at 1000 MHz, those tasks would take twice as long as they do on an original M1 chip. Although most users wouldn’t notice, as they interact almost entirely with processes running on the P cores, it wouldn’t be good for those tasks to take twice as long on the latest, fastest and most expensive of the M1 models. Yet because of the extreme efficiency of E cores, it costs no different in terms of power. Using just one P core for the same task would be slower than the two E cores running at 2000 MHz, and use five times more power.

That is one of the secrets behind the M1 Pro/Max design: macOS uses those two E cores even more efficiently. Tomorrow, I’ll go into greater detail and show fuller results.