Scheduling of Threads on M1 Series Chips: second draft

Following more work looking at the behaviour of M1 series chips, and your comments to my first draft earlier this month, this article is an attempt to move closer to an understanding. For the observations and evidence, and fuller details of how I’ve arrived at this, please refer to this article, and the several links it contains, and yesterday’s sequel.

The chips

In the context of CPU cores, there are currently two variants of the M1 chip: the original version, which shipped in 2020, and that used in the M1 Pro and Max models which shipped in late 2021.

The original M1 chip has two core clusters, each containing four cores. One cluster contains Efficiency (E) cores with a maximum frequency of 2064 MHz and about half the internal processing units of the Performance (P) cores in the other cluster. P cores also have a higher maximum frequency, of 3204 MHz in the M1 and 3228 MHz in the M1 Pro/Max.

In contrast, the M1 Pro/Max has three core clusters: one containing just two E cores, the others containing four P cores each. Cores are managed and perform in those clusters. For instance, when you load four high-priority threads onto an M1 Pro/Max chip, they will be run in the first P cluster, and whenever possible the second P cluster will remain unloaded and inactive. Frequency is also set per cluster, and shouldn’t differ between cores within any given cluster.

Grand Central Dispatch

This all starts with the creation of a thread as an Operation or similar, which is added to a queue with an assigned Quality of Service (QoS). Threads with the lowest QoS of 9 are deemed ‘background’, and will be run exclusively on the E cores; those with higher QoS, up to the maximum of 33, are deemed ‘user’ threads, and are eligible to be run on either P or E cores, according to their availability.

Threads are dispatched from queues in first in, first out order, their priority set by the QoS of the queue (thus of the threads in that queue). Thus, when there are threads ready for execution in a queue with the highest QoS, they will be dispatched before those waiting in queues with lower QoS. When clusters are essentially idle, threads will be dispatched in batches to fill the vacant slots in each cluster. For example, if the E cluster, consisting of four E cores in the original M1 chip, is almost idle, and there are ten threads in the low QoS queue, macOS will assign the first four of those threads to that cluster.

Cluster assignment

Threads with the lowest QoS will only be run on the E cluster, while those with higher QoS can be assigned to either E or P clusters. The latter behaviour can be modified dynamically by the taskpolicy command tool, or by the setpriority() function in code. Those can constrain higher QoS threads to execution only on E cores, or on either E or P cores. However, they cannot alter the rule that lowest QoS threads are only executed on the E cluster.

Background threads

Lowest QoS threads are loaded and run differently in original M1 and M1 Pro/Max chips, as they have different E cluster sizes.

In the original M1 chip, with four E cores, QoS 9 threads are run with the core frequency set at about 1000 MHz (1 GHz). What happens in the M1 Pro/Max with its two E cores is different: if there’s only one thread, it’s run on the cluster at a frequency of about 1000 MHz, but if there are two threads, the frequency is increased to 2000 MHz. This ensures that the E cluster in the M1 Pro/Max delivers at least the performance for background tasks as that in the original M1, at similar power consumption, despite the difference in size of the clusters.

The common exceptions to this are lowest QoS threads of processes such as backupd, which also undergo I/O throttling, and are run at a frequency of about 1000 MHz on the M1 Pro/Max.

User threads

All threads with a QoS higher than 9 appear to be handled similarly at present, differences resulting from the priority given to their queues.

As high QoS threads are eligible to be run on either of the core types and any core cluster, their management differs between M1 and M1 Pro/Max variants. On the original M1, with its single P cluster, batches of up to eight threads can be distributed to the two available clusters, with four thread slots available on each. When there are four or fewer threads, they will be run on the P cluster whenever possible, and the E cluster is only recruited when there are more high QoS threads in the queue. P cores are run at a frequency of about 3 GHz, and E cores at about 2 GHz, twice the frequency normally used for QoS 9 threads.

M1 Pro and Max chips have a total of three clusters, two of four P cores each, plus the half-size two-core E cluster. With up to four threads in the queue, they will be allocated to the first P cluster (P0); threads 5-8 will go to the second P cluster (P1), which would otherwise remain unloaded and inactive for economy. If there are a further two threads in the queue, they will be run on the E cores. Frequencies set are the maximum for the core type, to 3228 MHz on P0 and P1, and 2064 MHz on the E cluster.

schedulingThreadsM1

Here’s a tear-out PDF to take away: schedulingThreadsM1

Relevance to programmers

When creating threads for GCD, it’s now important if not essential to set the QoS for each queue. While this makes relatively little difference when running on Intel Macs, for M1 Macs it determines both the dispatch of threads and the cluster(s) those threads can run on. Choosing the appropriate QoS also merits careful consideration, and giving the user flexibility well worthwhile.

This is particularly true of threads which might benefit from being run only on the E cores. While that could be ideal in many use cases, consider giving the user an option in the app’s preferences, as the user can’t promote threads set to run in the background QoS so that they can also run on P cores. Designing to get the best out of QoS and the M1’s different cores could make a big difference to the user experience.

Relevance to users

Most of the time, it’s great that the many background services in macOS run exclusively on the E cores. However, when you do want to accelerate them, perhaps to complete a large Time Machine backup, there aren’t any tools which let those services use your M1 Mac’s P cores to get the job done any quicker.

If you have an app or tool which currently uses substantial amounts of P core time for work which you’d prefer run in the background on the E cores, there are two effective choices:

  • St. Clair Software’s App Tamer has an experimental option which will run such processes on the E cores when the app is in the background.
  • The command taskpolicy -b -p 567, where the last number is the PID of the process, will achieve the same demotion until reversed using taskpolicy -B -p 567.

Some apps do now give the user choice as to how to run longer tasks. Check the documentation, and ask the developer when they intend introducing this feature if it’s not available already.

I’m particularly grateful to Tony, and others who commented on my first draft.