What performance to expect in the Mac Studio

Following this week’s announcement of a fourth chip in the first generation of Apple Silicon systems, we’re all full of the heights of their performance. Apple threw us the teaser that it has yet to announce its replacement for the Mac Pro, which presumably will also use the new M1 Ultra. This article steps back from the hype and assesses performance already attained, and where it’s heading.

M1 family

There are now four chips in the M1 family, and according to Apple that’s the complete set:

  • M1, with 4 P and 4 E cores, 8 core GPU, and 16 core Neural Engine,
  • M1 Pro, with 8 P and 2 E cores, 16 core GPU, and 16 core Neural Engine,
  • M1 Max, with 8 P and 2 E cores, 32 core GPU, and 16 core Neural Engine,
  • M1 Ultra, with 16 P and 4 E cores, 64 core GPU, and 32 core Neural Engine,

There have been rumours of a fifth, consisting of four M1 Max chips conjoined, which could have been intended for the forthcoming Mac Pro, but it now appears most likely the replacement for Apple’s top-end model will also use the M1 Ultra – see the comments to this article for more thoughts about that.

Although there are differences in caches across the variants, their E cores are essentially the same, and the most significant difference in their P cores is that those in the original M1 chip have a slightly lower maximum frequency of 3204 MHz, while those in the M1 Pro/Max have a maximum of 3228 MHz.

CPU performance

When running tight loops of assembly code which only accesses in-core resources including registers, there’s a strong linear relationship between performance measured as the number of loops completed per second and the number of threads run.

m1allCoresFloatThreads

Looking first at the solid line, that’s a linear regression through the loop throughputs measured as 10^9 loops per second, against the number of test threads run on P cores. That has a gradient of 0.15, indicating that each P core runs this code at a rate of 150 million loops per second. The broken line is the equivalent regression for the four E cores in an original M1 chip, each of which runs the loop at a rate of 50 million loops per second, a third of a P core.

There are also two points plotted with an x which start on the regression line for the E cores, but rise sharply above it for 2 threads. Those are the results from the two E cores in an M1 Pro chip. With a single thread running on them, they follow the performance of the E cores in the M1, but loading a second thread results in a loop throughput which matches the total of all four E cores in the M1. That’s the result of core frequency control imposed by macOS.

For code running in sufficient threads to ensure that each core has full active residency, and whose performance isn’t limited by access to resources such as memory, we can expect a linear performance increase with increasing numbers of P cores. Effects of increasing the number of E cores are, though, largely determined by the way in which their frequency is managed.

Benchmark results

The first set of benchmarks for an M1 Ultra have been published by Juli Clover on MacRumours. These are based on the Geekbench 5 CPU suite, and indicate a single-core score of 1793, and multi-core of 24,055. My own previous tests on my M1 Pro returned a remarkably similar value of 1772 for single-core, and 12,548 multi-core, the latter being slightly more than half that of the M1 Ultra.

These are entirely in accordance with what you’d expect for threads being run at high QoS, where they’ll be given maximum frequency on both P and E cores. What they don’t tell us about is performance of the E cores when running threads at minimum QoS, which is more typical of macOS background tasks such as Time Machine backups and Spotlight indexing services.

Benchmarking GPU and Neural Engine performance is more complicated, and access to both is normally limited to APIs such as Metal for GPUs. For developers, the joy of these chips is that that access is largely transparent and handled by macOS. That should result in linear performance improvement with increasing number of cores, provided they can all be used by the app.

What to expect

User processes are almost exclusively run on P cores, with E cores being recruited when there’s sufficient demand. Those user processes should therefore be accelerated in proportion to the number of P cores, provided that there are sufficient threads to run. It’s that last requirement which is key: if there are 8 or fewer threads with high active occupancy, then the M1 Ultra’s 16 P cores will be of little or no extra benefit. Only when the number of heavyweight threads exceeds 8 will those extra cores result in improved performance over the M1 Pro/Max.

This is likely to be reinforced by macOS’s management of cores, which are grouped into clusters, typically of four cores (two in the case of E cores in the M1 Pro/Max). When running 4 or fewer heavyweight threads, only the first P cluster (P0) will be active; with 5-8 threads, the second P cluster (P1) will be added; the Ultra’s P2 and P3 clusters will normally remain inactive, at a frequency of 600 MHz and full idle, until 9 or more threads are fully active.

Where the M1 Ultra may prove little advantage is in macOS background tasks, which aren’t just configured to run on E cores alone but also usually have I/O throttling applied. Unless that throttling is eased and E cores are run at higher frequencies, tasks such as Time Machine backups and Spotlight indexing are likely to take as long on an Apple Studio equipped with an M1 Ultra as on an M1 Max, or even on an original M1.

Glossary

Active residency is the proportion (usually percentage) of clock cycles in which a core is actively processing, and not idling.

E core is an Efficiency core (Icestorm), designed for low power consumption while still delivering useful performance.

P core is a Performance core (Firestorm), designed for high performance at higher power consumption.

QoS is Quality of Service, a setting used in macOS to determine both priority and core allocation of threads. The lowest, background with a numeric value of 9, results in that thread being run exclusively on E cores; three higher values result in the thread being run preferentially on P cores, but they can be run on E cores when all P cores are already at high active residency.

Revised following the comments below, for which I’m very grateful, and updated 1800 GMT 10 March 2022.