Apple silicon: 1 Cores, clusters and performance

Over the last three years, Apple silicon Macs have acquired quite a reputation, being magic for some, while others seem only to find faults. This new series attempts to explain some of their more controversial aspects, from performance and power use, through temperature to energy efficiency.

Chips

M-series chips are a radical departure from Intel Macs, which had a chipset to perform the functions that are now contained in a single system-on-a-chip (SoC). Instead of a Mac having separate processor, memory, graphics card and chips to manage its input and output to network and peripherals, these are all integrated into a single chip. With that integration come two further departures: there are two types of CPU cores; and memory for those, the graphics processor (GPU) and others is common to them all, rather than being separated – Unified memory.

Starting with the CPU, there are expected to be four distinct chips in the M3 family: base M3, M3 Pro, M3 Max, and the unannounced M3 Ultra. Earlier M1 and M2 families differ in their core numbers, but the M3 is the first to make such a clear distinction between Pro and Max.

Core types

Every M-series chip contains two different types of core, Performance (P) and Efficiency (E), although the numbers of each vary according to which family they’re in, and which model they are. I’ll start with the base version, which is the most consistent across the three families as it has 4 P and 4 E cores in each case.

M3base

This is very different from the design of Intel CPUs used in Macs, whose cores are all identical. The aim of using two different core types is to better match performance and power use. Threads – the common chunks of code that are allocated to run on cores – don’t all have the same priority or performance requirements. Background tasks, like building Spotlight’s indexes, aren’t time-critical, so can be run on cores that are slower and use less power; user tasks, like handling controls in a window, need to be run as quickly as possible even if that means they aren’t as energy-efficient.

Core use

In some models of this type of CPU, such as Intel’s, the processor decides which type of core to run tasks on, according to a set of rules. In Apple silicon there’s greater flexibility, as it’s macOS that makes that decision on the basis of priorities allocated to threads, a setting known as their Quality of Service (QoS). Threads assigned a low QoS are almost invariably run on E cores, while those of higher QoS are preferentially allocated to P cores, but can still be run on E cores when needed.

The best way to see how this works is with an example.

A few seconds ago, Time Machine started making one of its automatic hourly backups, and at the moment is running two main threads. Because those are designated background processes with a low QoS, macOS has allocated them to the four cores in the E cluster. Next, a new message arrives in Messages, playing the normal jingle and posting a notification. You bring the Messages app to the front and start typing your reply. Although some of the threads involved with that may have low QoS, most of them are user-facing, so have high QoS. macOS thus allocates them to the four cores in the P0 cluster, where they’re run at maximum speed to ensure that your Mac feels responsive, and completes the tasks as quickly as it can.

There should be ample processing power available in the two clusters to cope with those straightforward tasks. What happens, though, if you’re also in the middle of encoding some video, while editing some high-resolution images, when a new message arrives? User tasks here may have almost fully occupied the four P cores in the P0 cluster. macOS then becomes more flexible in allocating threads: when there’s no P core to accept a new high QoS thread, it will allocate that to an E core instead, as overflow. To ensure that thread still gets processed as quickly as possible, the E cores are run at higher frequency, so they perform almost as well as P cores.

Variable frequency

This is another key feature of Apple silicon cores: unlike most older CPUs, the cores can be run at different frequencies or clock speeds, ranging from 696 MHz up to 4,056 MHz for M3 P cores. Although I don’t know whether M3 cores take advantage of this, it’s also possible that their voltage is variable.

Much of the time, E cores running background threads in an M3 chip do so with the cores at their minimum frequency of only 744 MHz, but when they’re used to run threads that have overflowed from P cores their frequency can rise to their maximum of 2,748 MHz. Overflowed threads on E cores thus can’t benefit from the same high performance as they would on P cores, but they do run much faster than background threads.

Clusters

Cores of the same type are grouped into clusters. Within each cluster, all the cores run at the same frequency, and they share local memory in their level 2 (L2) cache. That also makes it easier for threads to be moved around between cores in the same cluster. In M1 and M2 family chips, their E and P cores are arranged in clusters of up to four cores; Apple changed that in the M3, whose clusters can be of as many as 6 cores, all of the same type. In all base version CPUs of M-series chips, there’s one cluster of 4 E cores, and one cluster of 4 P cores, so the CPU can be designated as being 4P+4E. Pro, Max and Ultra versions have different numbers of cores, and may have two (or more) clusters of P cores in all.

M3pro

The M3 Pro is the only CPU with a full 6 E cores in a single cluster, together with 6 P cores, making it effectively 1.5 times the base M3, at 6P+6E. This increases its capacity for both background and high QoS threads. As it’s unusual for a 4-core E cluster to be fully occupied with background threads, a major role of those additional E cores is to provide for overflow from its P cores.

M3max

The M3 Max currently has the greatest CPU capacity of the M3 family, with its two 6-core clusters of P cores, but only 4 E cores, as 12P+4E, giving it less capacity for coping with overflow.

The M3 Pro and Max are also available in ‘binned’ versions with less P cores. Chip fabrication results in many that aren’t up to the mark for all their cores to perform as expected; M3 Pro chips with only 5 P cores (5P+6E), and M3 Max with only 10 P cores (10P+4E) still have the normal number of cores present, but one or two of the P cores have been intentionally disabled because they don’t function as expected.

Benchmarks

Comparing performance across the three chips currently in the M3 family is thus fraught with difficulty when it comes to multi-core performance. On a base M3, a multi-core benchmark is run on all 8 cores, but only half of them are intended to be high-performance. On an M3 Pro, the ratio is the same over 12 cores, while an M3 Max has twice as many cores as a base M3, but in a ratio of 3:1 P:E rather than 1:1, a significant benefit in benchmarking. In normal use, without high QoS threads overflowing to E cores, the M3 Pro is intended for 1.5 times the load of a base model, and the M3 Max 3 times the base. Multi-core benchmark results obscure those important subtleties.

Comparing single-core performance on the two types of core in an M-series chip is complex, because of the large difference in their operating frequencies. Threads on E cores at low QoS are normally run at a frequency of only 744 MHz, while those run on P cores can attain 4,056 MHz, well over five times the speed. Assessments of the M1 P and E cores suggest that its E cores have roughly half the computational capacity of P cores, but I’ve not seen comparable work for M3 cores. In-core performance measurements suggest that M3 E cores are a closer match with its P cores, at the same frequency. If that’s a reasonable approximation, then the best performance expected of an M3 E core is around 70% that of an M3 P core.

Performance benchmarks also generally omit information about power, the central topic in the next article in this series. As an introduction to that, let me give you some figures for the approximate power drawn by different core types. These aren’t power at the wall socket, but measured for the CPU itself.

When running fully loaded, with each core at 100% CPU, a six-core P cluster uses between 5-13 W depending on the type of task. A six-core E cluster uses less than 0.2 W when running low QoS threads at low frequency, and around 1.2 W when running overflow high QoS threads at maximum frequency. Those are huge differences, and illustrate the potential benefits of the two core types.

Concepts

  • Apple silicon chips integrate functions from a whole chipset, featuring two CPU core types and Unified memory.
  • The two core types are specialised for Efficiency (E) and Performance (P), and offered in different numbers and ratios across each family.
  • Background, low QoS, threads are run on E cores to minimise energy use; user-facing, high QoS, threads are run preferentially on P cores, but can overflow onto E cores.
  • Cores are run at variable frequency (and perhaps voltage), with E cores at low frequency for background threads, and high frequency for overflow high QoS threads.
  • Cores are grouped into single-type clusters running at the same frequency and sharing L2 cache. M1 and M2 maximum cluster size is 4, and increased to 6 in M3.
  • Multi-core benchmarks are confounded by the number and ratio of core types.
  • Power used by a 6-core E cluster is about 0.2 W for low frequency, and a 6-core P cluster can use 5-13 W depending on task.

Further reading

Evaluating M3 Pro CPU cores: 1 General performance
Evaluating M3 Pro CPU cores: 2 Power and energy
Evaluating M3 Pro CPU cores: 3 Special CPU modes
Evaluating M3 Pro CPU cores: 4 Vector processing in NEON
Evaluating M3 Pro CPU cores: 5 Quest for the AMX
Evaluating the M3 Pro: Summary
Finding and evaluating AMX co-processors in Apple silicon chips
Comparing Accelerate performance on Apple silicon and Intel cores
M3 CPU cores have become more versatile