Scheduling of Processes on M1 Series Chips: first draft

As I promised in yesterday’s article about the management by macOS 12 of many processes on the cores in M1 series chips, here is my first draft account, complete with an initial flowchart of sorts. For the observations and evidence, and fuller details of how I’ve arrived at this, please refer to that previous article, and the several links it contains.

The chips

In the context of CPU cores, there are currently two variants of the M1 chip: the original version, which shipped in 2020, and that used in the M1 Pro and Max models which shipped in late 2021.

The original M1 chip has two core clusters, each containing four cores. One cluster contains Efficiency (E) cores with a maximum frequency of 2064 MHz and about half the internal processing units of the Performance (P) cores in the other cluster. P cores also have a higher maximum frequency, of 3204 MHz in the M1 and 3228 MHz in the M1 Pro/Max.

In contrast, the M1 Pro/Max has three core clusters: one containing just two E cores, the others containing four P cores each. Cores are managed and perform in those clusters. For instance, when you load four high-priority processes onto an M1 Pro/Max chip, they will be run in the first P cluster, and whenever possible the second P cluster will remain unloaded and inactive. Frequency is also set per cluster, and shouldn’t differ between cores within any given cluster.

Queues

This all starts with the creation of an Operation or similar, with an assigned Quality of Service (QoS), which determines how macOS will schedule it. Those processes with the lowest QoS of 9 are deemed ‘background’ processes, and will be run exclusively on the E cores; those with higher QoS, up to the maximum of 33, are deemed ‘user’ processes, and are eligible to be run on either P or E cores, according to their availability. Although I don’t (yet) have any direct evidence, I suspect that the queues of processes for these two QoS types are maintained separately. I also suspect that the other two intermediate QoS values are handled here as QoS 33.

When a process slot becomes free on one of the designated types of core for that queue, macOS assigns that process to the slot. Much of the time I have been observing M1 chips which are nearly idle, thus with all their slots available. In those circumstances, when there are multiple processes in the queue, macOS will allocate them to clusters in batches. For example, if the E cluster, consisting of four E cores in the original M1 chip, is almost idle, and there are ten processes in the low QoS queue, macOS will assign the first four of those processes to that cluster.

Background processes

Low QoS processes are loaded and run differently in original M1 and M1 Pro/Max chips, as they have different E cluster sizes.

In the original M1 chip, with four E cores, QoS 9 processes are run with the core frequency set at about 1000 MHz (1 GHz). What happens in the M1 Pro/Max with its two E cores is different: if there’s only one process, it’s run on the cluster at a frequency of about 1000 MHz, but if there are two processes, the frequency is increased to 2000 MHz. This ensures that the E cluster in the M1 Pro/Max delivers at least the performance for background tasks as that in the original M1, at similar power consumption, despite the difference in size of the clusters.

User processes

All processes with a QoS higher than 9 appear to be handled similarly at present, although further work is needed to investigate that properly.

As high QoS processes are eligible to be run on either of the core types and any core cluster, their management differs between M1 and M1 Pro/Max variants. On the original M1, with its single P cluster, batches of up to eight processes can be distributed to the two available clusters, with four process slots available on each. When there are four or fewer processes, they will be run on the P cluster whenever possible, and the E cluster is only recruited when there are more high QoS processes in the queue. P cores are run at a frequency of about 3 GHz, and E cores at about 2 GHz, twice the frequency normally used for QoS 9 processes.

M1 Pro and Max chips have a total of three clusters, two of four P cores each, plus the half-size two-core E cluster. With up to four processes in the queue, they will be allocated to the first P cluster (P0); processes 5-8 will go to the second P cluster (P1), which would otherwise remain unloaded and inactive for economy. If there are a further two processes in the queue, they will be run on the E cores. Frequencies set are the maximum for the core type, to 3228 MHz on P0 and P1, and 2064 MHz on the E cluster.

Here’s a tear-out PDF to take away: schedulingprocessesm1

Contention

The greatest limitation of my testing to date is that I haven’t observed how contention from other parent processes and across different QoS might affect these behaviours. For instance, if one app is trying to run processes with a high QoS and another is trying to run processes with a low QoS. Does macOS then reserve the E cluster for the latter, and limit the former to the P cluster(s)? Are processes with the highest QoS of 33 given priority over those with either of the two intermediate QoS values?

I will be looking at these in the coming weeks, and reporting back. In the meantime, if you’re aware of any other evidence which confirms or contradicts any of the above, please let me know, preferably by comment below (or email). I value your thoughts and results.

7Comments

Add yours

1

Bob on January 13, 2022 at 2:20 pm

Hi, the macOS kernel’s open source may be helpful – although Apple haven’t yet updated their public repo with Monterey, the Big Sur code has been published. Start with the scheduler documentation: https://github.com/apple/darwin-xnu/blob/main/osfmk/kern/sched_clutch.md

(There’s a newer version published here, though it isn’t Monterey: https://opensource.apple.com/source/xnu/xnu-7195.81.3/osfmk/kern )

LikeLiked by 2 people
- 2
  
  hoakley on January 13, 2022 at 8:08 pm
  
  Thank you.
  Unfortunately, the most important information is in the Monterey source, as that’s required to support the M1 Pro/Max chips. From what I’ve seen in the macOS 11 source, there’s more to it than just the kernel, whose source doesn’t explain how the assigned QoS is converted into the internal QoS value, nor how QoS 9 processes end up confined to the E cores.
  Sometimes trying to work out how software behaves from its source is more laborious and may even mislead.
  Another useful source is Apple’s relevant patent filings, although they too aren’t easy to understand, and often don’t match observations, or differ considerably in detail.
  That said, the Monterey source could make very relevant reading.
  Howard.
  
  LikeLike
3

Dave on January 13, 2022 at 3:02 pm

Interesting, very interesting, thank you.

I am wondering what the caching hierarchy design is for the M1 Pro/Max. Also, if there is there any way to assign a child process to a specific core or cluster to potentially reduce L0 cache thrashing on multiple, independent, memory-intensive child processes.
I am dealing with that exact question on an Intel 8-core processor, each with a hyper thread.
16 child processes was a performance disaster, 2 was the performance winner – by a longshot……

LikeLiked by 1 person
- 4
  
  hoakley on January 13, 2022 at 8:11 pm
  
  Thank you. I think the caching has been explored by others far more knowledgeable than I. My tests are designed so that there’s no dependency on caching at all.
  No, as far as I’m aware, macOS gives no access to user assignment of processes to either cores or clusters. The only control is via QoS, which determines which types of cores are eligible, as explained above.
  If you want that level of control, I believe that Asahi Linux should be able to oblige.
  Howard.
  
  LikeLike
5

Tony on January 14, 2022 at 6:28 pm

This is a fascinating area so thank you for the work you have done. Since I expect you will expand this as time goes on, I would like to question some of the terminology used, particularly processes and threads.

I believe from your diagram that you have characterised Dispatch Queues (in GCD) with sequential execution. GCD uses a pool of threads to execute the work loaded onto its queues and, AIUI, remains within a single process.

This has two consequences. One is that you are studying threads, not processes; that’s probably just a terminology issue (though the difference between threads and processes is quite fundamental). The other consequence is that you are seeing both the macOS scheduler’s algorithms and GCD’s algorithms. Again, GCD’s contribution here is probably minimal (though Apple does say “GCD, operating at the system level, can better accommodate the needs of all running applications, matching them to the available system resources in a balanced fashion”*).

As well as the terminology issue, I raise this since it would be interesting to know if there are significant differences using threads created long-hand (eg with pThread) or with actual processes (eg from fork). These would be concurrent rather than serial, of course. Using separate processes, though not usually the right approach for an app, would presumably study ‘pure’ macOS scheduler, having removed GCD from the equation: that may be beyond your intended scope.

* From: https://developer.apple.com/documentation/dispatch?language=objc

LikeLiked by 1 person
- 6
  
  hoakley on January 14, 2022 at 11:08 pm
  
  Thank you, Tony.
  You are, of course, entirely correct: these are within-process threads, although I plead the fact that Apple’s documentation doesn’t in this case even mention GCD.
  Yes, this does look at all the macOS scheduling mechanisms, and that’s intentional. I don’t know of any way of assigning full processes, like the threads in an app, to specific core types. QoS doesn’t appear to apply to them, for example in their environment. But what the developer needs to know is how they can guide macOS to schedule the threads in their app appropriately.
  I will push out a corrected account next week, with processes replaced throughout by threads, and a little explanation as to how this fits into the grand scheme.
  Howard.
  
  LikeLike
  - 7
    
    Tony on January 18, 2022 at 6:08 pm
    
    Thanks. I look forward to reading more .
    
    LikeLiked by 1 person

Share this:

Related