In conventional multi-core processors, like the Intel CPUs used in previous Mac models, all cores are the same. Allocating threads to cores is therefore a matter of balancing their load, in what’s termed symmetric multiprocessing (SMP).
In Activity Monitor’s CPU History window, core load (as CPU %) is shown against time, with the oldest values at the left. Odd-numbered cores in the left half are real, and show the eight cores in the 8-Core Intel Xeon W under heavy load. Even-numbered cores in the right half are the virtual cores of Hyper-threading, engaged to cope with the heaviest load.
CPUs in Apple Silicon chips are different, as they contain two different core types, one designed for high performance (Performance, P or Firestorm cores), the other for energy efficiency (Efficiency, E or Icestorm cores). For these to work well, threads need to be allocated by core type, a task which can be left to apps and processes, as it is in Asahi Linux, or managed by the operating system, as it is in macOS. This article explains how macOS manages core allocation in all Apple’s M1 series chips, in what it terms asymmetric multiprocessing (AMP, although others prefer to call this heterogeneous computing).
Architecture
There are two types of CPU core in M1 series chips:
- E cores contain roughly half the internal processing units of P cores, and have a maximum frequency of 2064 MHz.
- P cores have a higher maximum frequency, of either 3204 MHz in the original M1, or 3228 MHz in M1 Pro/Max/Ultra.
There are three configurations of CPU cores available in M1 series chips:
- the original M1, with 4 E and 4 P cores, in the MacBook Air, MacBook Pro 13-inch, iMac and Mac mini;
- M1 Pro and Max, with 2 E and 8 P cores, in the MacBook Pro 14- and 16-inch, and Mac Studio Max;
- M1 Ultra, with 4 E and 16 P cores, in the Mac Studio Ultra.
Some MacBook Pro 14-inch notebooks have a reduced M1 Pro chip with only 6 P cores instead of 8.
To simplify the management of cores, macOS divides them functionally into clusters of 2-4 cores of the same type. Unfortunately, numbering of cores at a system level, as shown by tools such as powermetrics
, and as displayed in Activity Monitor is different. For consistency with the latter, I here follow its core numbering, but number clusters in accordance with the system. The three chips have the following functional clusters as of macOS Monterey 12.3.1:
- the original M1 has one cluster of each type of core, E0 and P0, each containing 4 cores of the same type;
- M1 Pro and Max have one cluster of 2 E cores (E0), and two clusters each containing 4 P cores (P0, P1);
- M1 Ultra has one cluster of 4 E cores (E0), and four clusters each containing 4 P cores (P0, P1, P2, P3).
All cores within any given cluster are run at the same frequency, and generally (but not always) have their load balanced within the cluster. There are occasions when load is distributed more unevenly, and in exceptional cases, certain threads may be allocated to only one core within a cluster.
Thread control
Unlike Asahi Linux, macOS doesn’t provide direct access to cores, core types, or clusters, at least not in public APIs. Instead, these are normally managed through Grand Central Dispatch using Quality of Service (QoS) settings, which macOS then uses to determine thread management policies.
Threads with the lowest QoS will only be run on the E cluster, while those with higher QoS can be assigned to either E or P clusters. The latter behaviour can be modified dynamically by the taskpolicy
command tool, or by the setpriority()
function in code. Those can constrain higher QoS threads to execution only on E cores, or on either E or P cores. However, they cannot alter the rule that lowest QoS threads are only executed on the E cluster.
macOS itself adopts a strategy where most, if not all, of its background tasks are run at lowest QoS. These include automatic Time Machine backups and Spotlight index maintenance. This also applies to compression and decompression performed by Archive Utility: for example, if you download a copy of Xcode in xip format, decompressing that takes a long time as much of the code is constrained to the E cores, and there’s no way to change that.
Background threads
Lowest QoS threads are loaded and run differently in original M1 and M1 Pro/Max chips, as they have different E cluster sizes.
In the original M1 chip, with 4 E cores, QoS 9 threads are run with the core frequency set at about 1000 MHz (1 GHz). What happens in the M1 Pro/Max with its 2 E cores is different: if there’s only one thread, it’s run on the cluster at a frequency of about 1000 MHz, but if there are two or more threads, the frequency is increased to 2064 MHz. This ensures that the E cluster in the M1 Pro/Max delivers at least the performance for background tasks as that in the original M1, at similar power consumption, despite the difference in size of the clusters.
Common exceptions to this are lowest QoS threads of processes such as backupd
, which also undergo I/O throttling, and are run at a frequency of about 1000 MHz on the M1 Pro/Max.
User threads
All threads with a QoS higher than 9 are handled similarly, with differences resulting from the priority given to their queues.
As high QoS threads are eligible to be run on either of the core types and any core cluster, their management differs between M1 and M1 Pro/Max variants. On the original M1, with its single P cluster, batches of up to 8 threads can be distributed to the two available clusters, with 4 thread slots available on each. When there are 4 or fewer threads, they will be run on the P cluster whenever possible, and the E cluster is only recruited when there are more high QoS threads in the queue. P cores are run at a frequency of about 3 GHz, and E cores at about 2 GHz, twice the frequency normally used for QoS 9 threads.
M1 Pro and Max chips have a total of three clusters, two of 4 P cores each, plus the half-size 2-core E cluster. With up to 4 threads in the queue, they will be allocated to the first P cluster (P0); threads 5-8 will go to the second P cluster (P1), which would otherwise remain unloaded and inactive for economy. If there are a further 2 threads in the queue, they will be run on the E cores. Frequencies set are the maximum for the core type, to 3228 MHz on P0 and P1, and 2064 MHz on E0.
M1 Ultra chips have a total of five clusters, each with 4 cores. They follow the same policy as M1 Pro/Max chips, but with all 4 P clusters being loaded before E0 is used.
There are two situations in which code appears to run exclusively on a single core, though: during the boot process, before the kernel initialises and runs the other cores, code runs on just a single active E core. The other situation is when ‘preparing’ a downloaded macOS update before starting the installation process. On M1 Pro/Max chips, the 5 threads are given one core-worth of active residency, indicated as 100% CPU, but are confined to a single P core, the first in the first of the 2 P clusters (P0, labelled below as Core 3).
This unusual distribution of active residency is sustained throughout the 30 minutes of preparation to install the update.
Patterns under load
The effects of macOS policies are shown in the following more typical examples taken from the CPU History window of Activity Monitor.
This original M1 chip is here being subjected to a series of loads from increasing numbers of CPU-intensive threads. Its 2 clusters, E0 and P0, are distinguished by the blue boxes. With 1-4 threads at high QoS (from the left), the load is borne entirely in the P0 cluster, then with 5-8 threads the E0 cluster takes its share.
This M1 Pro chip is under heavy and changing load from many threads, some of which are at background QoS, while others are at higher QoS. While much of the load is borne by the 2 cores in the E0 cluster, P0 is also loaded for much of the time, and P1 is recruited to take some of the peak.
I have rearranged the cores shown in this example from an M1 Ultra to separate them into their clusters, with E0 at the top, and P0 to P3 in two columns below. Loads shown here are typical of those during the first few minutes after login, with heavy load on E0 and P0, which spills over to P1-3 during the early peak.
One important piece of information about M1 cores not (yet) provided by Activity Monitor is cluster frequency. A cluster running at 100% CPU (equivalent to active residency) with a frequency of less than 1000 MHz is completing instructions at less than half the rate of the same cluster at 100% CPU and a frequency of 2064 MHz. Unfortunately, the only accessible means of obtaining frequency information at present is the command tool powermetrics
.
A summary of macOS management of CPU cores in the original M1, M1 Pro and Max chips is given in the diagram below. As I complete information about the M1 Ultra, I will incorporate that in the next revision. If you have an M1 Ultra, are familiar with powermetrics
, and would like to help, I’d be delighted to work with you.
With Apple expected to announce the successor to its M1 series at the next WWDC in early June, it will be interesting to see its core architecture and the strategies offered by macOS for managing it.
I’m very grateful to Walt for providing information about and the screenshot of the Ultra under load.