How macOS manages M1 CPU cores

In conventional multi-core processors, like the Intel CPUs used in previous Mac models, all cores are the same. Allocating threads to cores is therefore a matter of balancing their load, in what’s termed symmetric multiprocessing (SMP).

In Activity Monitor’s CPU History window, core load (as CPU %) is shown against time, with the oldest values at the left. Odd-numbered cores in the left half are real, and show the eight cores in the 8-Core Intel Xeon W under heavy load. Even-numbered cores in the right half are the virtual cores of Hyper-threading, engaged to cope with the heaviest load.

CPUs in Apple Silicon chips are different, as they contain two different core types, one designed for high performance (Performance, P or Firestorm cores), the other for energy efficiency (Efficiency, E or Icestorm cores). For these to work well, threads need to be allocated by core type, a task which can be left to apps and processes, as it is in Asahi Linux, or managed by the operating system, as it is in macOS. This article explains how macOS manages core allocation in all Apple’s M1 series chips, in what it terms asymmetric multiprocessing (AMP, although others prefer to call this heterogeneous computing).

Architecture

There are two types of CPU core in M1 series chips:

E cores contain roughly half the internal processing units of P cores, and have a maximum frequency of 2064 MHz.
P cores have a higher maximum frequency, of either 3204 MHz in the original M1, or 3228 MHz in M1 Pro/Max/Ultra.

There are three configurations of CPU cores available in M1 series chips:

the original M1, with 4 E and 4 P cores, in the MacBook Air, MacBook Pro 13-inch, iMac and Mac mini;
M1 Pro and Max, with 2 E and 8 P cores, in the MacBook Pro 14- and 16-inch, and Mac Studio Max;
M1 Ultra, with 4 E and 16 P cores, in the Mac Studio Ultra.

Some MacBook Pro 14-inch notebooks have a reduced M1 Pro chip with only 6 P cores instead of 8.

To simplify the management of cores, macOS divides them functionally into clusters of 2-4 cores of the same type. Unfortunately, numbering of cores at a system level, as shown by tools such as powermetrics, and as displayed in Activity Monitor is different. For consistency with the latter, I here follow its core numbering, but number clusters in accordance with the system. The three chips have the following functional clusters as of macOS Monterey 12.3.1:

the original M1 has one cluster of each type of core, E0 and P0, each containing 4 cores of the same type;
M1 Pro and Max have one cluster of 2 E cores (E0), and two clusters each containing 4 P cores (P0, P1);
M1 Ultra has one cluster of 4 E cores (E0), and four clusters each containing 4 P cores (P0, P1, P2, P3).

All cores within any given cluster are run at the same frequency, and generally (but not always) have their load balanced within the cluster. There are occasions when load is distributed more unevenly, and in exceptional cases, certain threads may be allocated to only one core within a cluster.

Thread control

Unlike Asahi Linux, macOS doesn’t provide direct access to cores, core types, or clusters, at least not in public APIs. Instead, these are normally managed through Grand Central Dispatch using Quality of Service (QoS) settings, which macOS then uses to determine thread management policies.

Threads with the lowest QoS will only be run on the E cluster, while those with higher QoS can be assigned to either E or P clusters. The latter behaviour can be modified dynamically by the taskpolicy command tool, or by the setpriority() function in code. Those can constrain higher QoS threads to execution only on E cores, or on either E or P cores. However, they cannot alter the rule that lowest QoS threads are only executed on the E cluster.

macOS itself adopts a strategy where most, if not all, of its background tasks are run at lowest QoS. These include automatic Time Machine backups and Spotlight index maintenance. This also applies to compression and decompression performed by Archive Utility: for example, if you download a copy of Xcode in xip format, decompressing that takes a long time as much of the code is constrained to the E cores, and there’s no way to change that.

Background threads

Lowest QoS threads are loaded and run differently in original M1 and M1 Pro/Max chips, as they have different E cluster sizes.

In the original M1 chip, with 4 E cores, QoS 9 threads are run with the core frequency set at about 1000 MHz (1 GHz). What happens in the M1 Pro/Max with its 2 E cores is different: if there’s only one thread, it’s run on the cluster at a frequency of about 1000 MHz, but if there are two or more threads, the frequency is increased to 2064 MHz. This ensures that the E cluster in the M1 Pro/Max delivers at least the performance for background tasks as that in the original M1, at similar power consumption, despite the difference in size of the clusters.

Common exceptions to this are lowest QoS threads of processes such as backupd, which also undergo I/O throttling, and are run at a frequency of about 1000 MHz on the M1 Pro/Max.

User threads

All threads with a QoS higher than 9 are handled similarly, with differences resulting from the priority given to their queues.

As high QoS threads are eligible to be run on either of the core types and any core cluster, their management differs between M1 and M1 Pro/Max variants. On the original M1, with its single P cluster, batches of up to 8 threads can be distributed to the two available clusters, with 4 thread slots available on each. When there are 4 or fewer threads, they will be run on the P cluster whenever possible, and the E cluster is only recruited when there are more high QoS threads in the queue. P cores are run at a frequency of about 3 GHz, and E cores at about 2 GHz, twice the frequency normally used for QoS 9 threads.

M1 Pro and Max chips have a total of three clusters, two of 4 P cores each, plus the half-size 2-core E cluster. With up to 4 threads in the queue, they will be allocated to the first P cluster (P0); threads 5-8 will go to the second P cluster (P1), which would otherwise remain unloaded and inactive for economy. If there are a further 2 threads in the queue, they will be run on the E cores. Frequencies set are the maximum for the core type, to 3228 MHz on P0 and P1, and 2064 MHz on E0.

M1 Ultra chips have a total of five clusters, each with 4 cores. They follow the same policy as M1 Pro/Max chips, but with all 4 P clusters being loaded before E0 is used.

There are two situations in which code appears to run exclusively on a single core, though: during the boot process, before the kernel initialises and runs the other cores, code runs on just a single active E core. The other situation is when ‘preparing’ a downloaded macOS update before starting the installation process. On M1 Pro/Max chips, the 5 threads are given one core-worth of active residency, indicated as 100% CPU, but are confined to a single P core, the first in the first of the 2 P clusters (P0, labelled below as Core 3).

This unusual distribution of active residency is sustained throughout the 30 minutes of preparation to install the update.

Patterns under load

The effects of macOS policies are shown in the following more typical examples taken from the CPU History window of Activity Monitor.

This original M1 chip is here being subjected to a series of loads from increasing numbers of CPU-intensive threads. Its 2 clusters, E0 and P0, are distinguished by the blue boxes. With 1-4 threads at high QoS (from the left), the load is borne entirely in the P0 cluster, then with 5-8 threads the E0 cluster takes its share.

This M1 Pro chip is under heavy and changing load from many threads, some of which are at background QoS, while others are at higher QoS. While much of the load is borne by the 2 cores in the E0 cluster, P0 is also loaded for much of the time, and P1 is recruited to take some of the peak.

I have rearranged the cores shown in this example from an M1 Ultra to separate them into their clusters, with E0 at the top, and P0 to P3 in two columns below. Loads shown here are typical of those during the first few minutes after login, with heavy load on E0 and P0, which spills over to P1-3 during the early peak.

One important piece of information about M1 cores not (yet) provided by Activity Monitor is cluster frequency. A cluster running at 100% CPU (equivalent to active residency) with a frequency of less than 1000 MHz is completing instructions at less than half the rate of the same cluster at 100% CPU and a frequency of 2064 MHz. Unfortunately, the only accessible means of obtaining frequency information at present is the command tool powermetrics.

A summary of macOS management of CPU cores in the original M1, M1 Pro and Max chips is given in the diagram below. As I complete information about the M1 Ultra, I will incorporate that in the next revision. If you have an M1 Ultra, are familiar with powermetrics, and would like to help, I’d be delighted to work with you.

With Apple expected to announce the successor to its M1 series at the next WWDC in early June, it will be interesting to see its core architecture and the strategies offered by macOS for managing it.

I’m very grateful to Walt for providing information about and the screenshot of the Ultra under load.

25Comments

Add yours

1

Walt on April 25, 2022 at 8:16 am

Howard, another nicely written explainer article.

LikeLiked by 1 person
- 2
  
  hoakley on April 25, 2022 at 12:25 pm
  
  Thank you.
  I reiterate my gratitude to you – I’ll explain more by email, but I would be delighted if we can continue to collaborate with the frequency measurements in a few days when I have more time.
  Howard.
  
  LikeLike
3

Graham Lee on April 25, 2022 at 10:17 am

The XNU source code has [good documentation on the scheduler](https://github.com/apple-oss-distributions/xnu/blob/xnu-8019.80.24/doc/sched_clutch_edge.md) and the rationale for its design.

LikeLiked by 1 person
- 4
  
  hoakley on April 25, 2022 at 12:29 pm
  
  Thank you – that makes fascinating reading, if a little abstract in this context.
  Howard.
  
  LikeLike
5

Paolo on April 25, 2022 at 11:00 am

It seems strange to me that in an M1 Ultra all 4 efficiency cores are in the same cluster since they are physically separated

LikeLiked by 1 person
- 6
  
  hoakley on April 25, 2022 at 12:31 pm
  
  Thank you.
  I wasn’t sure whether Apple would opt for one or two clusters. As this is a matter of how macOS allocates threads and controls frequency, I don’t think that there’s any problem with the cluster spanning both chiplets in this way. However, as there was no previous way to manage two E clusters, that would have posed new problems, so the single cluster makes sense.
  Howard.
  
  LikeLike
7

Daniel on April 25, 2022 at 11:25 am

One interesting thing I noticed is that code running on P-cores that yields to the scheduler (e.g. using `pthread_yield_np()` or `std::this_thread::yield()`) is demoted to the E-cores. This can result in unexpected performance, for example when measuring the performance of OpenMP barriers (there, setting KMP_USE_YIELD=0 for Intel OpenMP run-time used by LLVM helps).

LikeLiked by 1 person
- 8
  
  hoakley on April 25, 2022 at 12:32 pm
  
  Thank you. That seems to be a bug, I would have thought. Does that thread have an explicit QoS set for it?
  Howard.
  
  LikeLike
9

Liam on April 25, 2022 at 1:10 pm

Interesting looking at the CPU History of my i9 under Catalina. It shows very clearly that it runs work biased to the full cores, and minimizes putting work on the hyperthreads. It also looks like it prefers COre and over other cores. I can send you a shapshot of what I am seeing if you would like.

LikeLiked by 1 person
- 10
  
  hoakley on April 25, 2022 at 2:01 pm
  
  Thank you. Hyper-threading will only be engaged when the ‘real’ cores are heavily loaded, as shown. Scheduling normally aims to produce roughly equal load, but sometimes it does seem to favour certain cores. As they’re all the same, it doesn’t make any difference, unlike in an M1 with E and P cores.
  Howard.
  
  LikeLike
11

Liam on April 25, 2022 at 1:11 pm

Apologies – should have said “prefers Core 1 over other cores”

LikeLiked by 1 person
12

hstriepe on April 25, 2022 at 6:05 pm

Thank you for your continued efforts. You produce more in retirement than many full time authors.
Watching cores on my M1 Ultra is interesting. Unless the load gets going with Xcode or Logic, core 8 through 10 show little load and the second set of 8 Performance cores shows NO load. As you point out this is very different from Intel Mac behavior.

LikeLiked by 1 person
- 13
  
  hoakley on April 25, 2022 at 10:13 pm
  
  Thank you.
  Yes, I think it’s really wonderful to see a whole cluster just idling at 600 MHz, consuming almost no power, and generating no heat. This is sensible computing.
  Howard.
  
  LikeLike
14

Warren Nagourney on April 25, 2022 at 7:29 pm

Thank you for an interesting article on core usage. Do you know whether Apple will support symmetric multiprocessing under the control of the program? For example, in a ray tracing program one can use the Neon engine to obtain a 4-fold speed up with single precision floating point. Can one obtain another 8-fold improvement using the 8 P cores on an M1 Pro in parallel?

Thanks!

LikeLiked by 1 person
- 15
  
  hoakley on April 25, 2022 at 10:15 pm
  
  Thank you.
  You can do that already: limit the number of threads to no more than 8, and give them any QoS other than the lowest. They’ll then be run on the P cores in parallel.
  Howard.
  
  LikeLike
  - 16
    
    Warren Nagourney on April 25, 2022 at 10:39 pm
    
    Thank you.
    
    LikeLiked by 1 person
17

Michael Tsai - Blog - How macOS Manages M1 CPU Cores on April 25, 2022 at 9:28 pm

[…] Howard Oakley: […]

LikeLike
18

B on April 27, 2022 at 3:49 pm

Slightly off topic, but what software do you use to make your flowcharts? Thanks for another great article.

LikeLiked by 1 person
- 19
  
  hoakley on April 27, 2022 at 3:51 pm
  
  Scapple. It’s cheap, simple to use and quick.
  Howard
  
  LikeLike
  - 20
    
    B on April 27, 2022 at 4:17 pm
    
    Thanks for the quick reply, you’re the best!
    
    LikeLiked by 1 person
    - 21
      
      hoakley on April 27, 2022 at 7:10 pm
      
      Thank you.
      Howard.
      
      LikeLike
22

Dino A. Navarroli on June 4, 2022 at 1:32 am

Looking for fellow tech bloggers with an in-depth writing style and came across this post. Very excellently written up! It’s nice to be able to read further into how the M1 chip is designed. The efficiency and power is incredible, and you explained it in the same manner. I’m happy to follow you! Looking forward to more of these types of articles.

LikeLiked by 1 person
- 23
  
  hoakley on June 4, 2022 at 6:15 am
  
  Thank you.
  Howard.
  
  LikeLike
24

Kilrah on July 27, 2022 at 3:13 pm

A friend has one of the “MacBook Pro 14-inch notebooks have a reduced M1 Pro chip with only 6 P cores instead of 8”, and his CPU history graph doesn’t show the type, suggests they instead nixed the E-cores instead of 2 P ones?

LikeLiked by 1 person
- 25
  
  hoakley on July 27, 2022 at 9:06 pm
  
  The two variants available are 6P+2E, or 8P+2E. They’re essentially the same chip, I believe, but in the cheaper version, two of the eight P cores didn’t pass test, so are disabled. That should give a total of 8 cores. It’s possible that Activity Monitor has become baffled by this, but you can easily check in System Information, or using the powermetrics command tool.
  Howard.
  
  LikeLike