hoakley January 25, 2022 Macs, Technology

Scheduling of Threads on M1 Series Chips: second draft

Following more work looking at the behaviour of M1 series chips, and your comments to my first draft earlier this month, this article is an attempt to move closer to an understanding. For the observations and evidence, and fuller details of how I’ve arrived at this, please refer to this article, and the several links it contains, and yesterday’s sequel.

The chips

In the context of CPU cores, there are currently two variants of the M1 chip: the original version, which shipped in 2020, and that used in the M1 Pro and Max models which shipped in late 2021.

The original M1 chip has two core clusters, each containing four cores. One cluster contains Efficiency (E) cores with a maximum frequency of 2064 MHz and about half the internal processing units of the Performance (P) cores in the other cluster. P cores also have a higher maximum frequency, of 3204 MHz in the M1 and 3228 MHz in the M1 Pro/Max.

In contrast, the M1 Pro/Max has three core clusters: one containing just two E cores, the others containing four P cores each. Cores are managed and perform in those clusters. For instance, when you load four high-priority threads onto an M1 Pro/Max chip, they will be run in the first P cluster, and whenever possible the second P cluster will remain unloaded and inactive. Frequency is also set per cluster, and shouldn’t differ between cores within any given cluster.

Grand Central Dispatch

This all starts with the creation of a thread as an Operation or similar, which is added to a queue with an assigned Quality of Service (QoS). Threads with the lowest QoS of 9 are deemed ‘background’, and will be run exclusively on the E cores; those with higher QoS, up to the maximum of 33, are deemed ‘user’ threads, and are eligible to be run on either P or E cores, according to their availability.

Threads are dispatched from queues in first in, first out order, their priority set by the QoS of the queue (thus of the threads in that queue). Thus, when there are threads ready for execution in a queue with the highest QoS, they will be dispatched before those waiting in queues with lower QoS. When clusters are essentially idle, threads will be dispatched in batches to fill the vacant slots in each cluster. For example, if the E cluster, consisting of four E cores in the original M1 chip, is almost idle, and there are ten threads in the low QoS queue, macOS will assign the first four of those threads to that cluster.

Cluster assignment

Threads with the lowest QoS will only be run on the E cluster, while those with higher QoS can be assigned to either E or P clusters. The latter behaviour can be modified dynamically by the taskpolicy command tool, or by the setpriority() function in code. Those can constrain higher QoS threads to execution only on E cores, or on either E or P cores. However, they cannot alter the rule that lowest QoS threads are only executed on the E cluster.

Background threads

Lowest QoS threads are loaded and run differently in original M1 and M1 Pro/Max chips, as they have different E cluster sizes.

In the original M1 chip, with four E cores, QoS 9 threads are run with the core frequency set at about 1000 MHz (1 GHz). What happens in the M1 Pro/Max with its two E cores is different: if there’s only one thread, it’s run on the cluster at a frequency of about 1000 MHz, but if there are two threads, the frequency is increased to 2000 MHz. This ensures that the E cluster in the M1 Pro/Max delivers at least the performance for background tasks as that in the original M1, at similar power consumption, despite the difference in size of the clusters.

The common exceptions to this are lowest QoS threads of processes such as backupd, which also undergo I/O throttling, and are run at a frequency of about 1000 MHz on the M1 Pro/Max.

User threads

All threads with a QoS higher than 9 appear to be handled similarly at present, differences resulting from the priority given to their queues.

As high QoS threads are eligible to be run on either of the core types and any core cluster, their management differs between M1 and M1 Pro/Max variants. On the original M1, with its single P cluster, batches of up to eight threads can be distributed to the two available clusters, with four thread slots available on each. When there are four or fewer threads, they will be run on the P cluster whenever possible, and the E cluster is only recruited when there are more high QoS threads in the queue. P cores are run at a frequency of about 3 GHz, and E cores at about 2 GHz, twice the frequency normally used for QoS 9 threads.

M1 Pro and Max chips have a total of three clusters, two of four P cores each, plus the half-size two-core E cluster. With up to four threads in the queue, they will be allocated to the first P cluster (P0); threads 5-8 will go to the second P cluster (P1), which would otherwise remain unloaded and inactive for economy. If there are a further two threads in the queue, they will be run on the E cores. Frequencies set are the maximum for the core type, to 3228 MHz on P0 and P1, and 2064 MHz on the E cluster.

schedulingThreadsM1

Here’s a tear-out PDF to take away: schedulingThreadsM1

Relevance to programmers

When creating threads for GCD, it’s now important if not essential to set the QoS for each queue. While this makes relatively little difference when running on Intel Macs, for M1 Macs it determines both the dispatch of threads and the cluster(s) those threads can run on. Choosing the appropriate QoS also merits careful consideration, and giving the user flexibility well worthwhile.

This is particularly true of threads which might benefit from being run only on the E cores. While that could be ideal in many use cases, consider giving the user an option in the app’s preferences, as the user can’t promote threads set to run in the background QoS so that they can also run on P cores. Designing to get the best out of QoS and the M1’s different cores could make a big difference to the user experience.

Relevance to users

Most of the time, it’s great that the many background services in macOS run exclusively on the E cores. However, when you do want to accelerate them, perhaps to complete a large Time Machine backup, there aren’t any tools which let those services use your M1 Mac’s P cores to get the job done any quicker.

If you have an app or tool which currently uses substantial amounts of P core time for work which you’d prefer run in the background on the E cores, there are two effective choices:

St. Clair Software’s App Tamer has an experimental option which will run such processes on the E cores when the app is in the background.
The command taskpolicy -b -p 567, where the last number is the PID of the process, will achieve the same demotion until reversed using taskpolicy -B -p 567.

Some apps do now give the user choice as to how to run longer tasks. Check the documentation, and ask the developer when they intend introducing this feature if it’s not available already.

I’m particularly grateful to Tony, and others who commented on my first draft.

12Comments

Add yours

1

Bryan Christianson on January 25, 2022 at 9:19 am

By default a DispatchQueue is created with qos of DispatchQoS.unspecified unless the qos is overriden. How does this affect the cpu scheduling of the thread on Apple Silicon? How about the `default` class which I assume is different from unspecified?

I have never specified qos when creating a queue and wonder if I really do need to go through my code and start making changes. There is always the danger of demoting the priority of a thread only to find that it can inadvertently block a higher priority thread, leading to unintended consequences. For now, I have left these decisions up to macOS.

LikeLiked by 1 person
- 2
  
  hoakley on January 25, 2022 at 9:54 am
  
  Thank you, Bryan.
  I suspect in What Route your code will be so network-dependent that it wouldn’t make much difference. However, I still prefer to set explicit QoS rather than leave macOS to guess a default.
  I think that unspecified QoS becomes interpreted as default, for which Apple says:
  “Whenever possible, an appropriate quality of service is determined from available sources. Otherwise, some quality of service level between NSQualityOfServiceUserInteractive and NSQualityOfServiceUtility is used.”
  That all seems horribly vague! It’s also worth noting that internally QoS appears different, and may even be a floating-point number when it’s actually used by GCD.
  If you’ve only got one queue, then QoS isn’t going to affect dispatching. I don’t think that it could in any case block a higher priority thread, as threads will surely only be dispatched when they’re ready?
  Where QoS could make a difference is allocation to core type. If you do want, either by default or as a user option, to run threads only on the E cores, then this is the only way to do so. It’s fairly quick to add a slider to preferences which lets the user choose whether to run threads at lowest, or higher, QoS. If you’re worried about ill-informed choices, you can make the default towards the higher end. Although I’m not sure that would be useful for your app, I can think of many where I’d like to have that choice. And as Apple ships more Apple Silicon Macs with different chips, that could become quite important.
  Howard.
  
  LikeLike
  - 3
    
    Bryan Christianson on January 25, 2022 at 7:11 pm
    
    Thanks for the reply Howard.
    
    The case I was considering was the one in which a low priority thread/task/queue (I know these terms are only somewhat interchangeable) needs to lock a resource or other object in order to update its value. While this shared resource is locked no other thread can access the resource and any that try will be suspended while waiting for the lock to be released. If the lock holding thread is scheduled out by macOS because some other process has to run, then the higher priority threads will be effectively blocked until macOS allocates time to the lock holder. I think the general term for this is priority inversion – i.e. the lowest priority thread is now critical.
    
    As you say – the documentation is very vague. I’ll experiment to see what effect changing from the .unspecified qos has on performance, specifically changing it to the `default` setting.
    
    Thank you for your help and work on this rather complex topic.
    
    LikeLiked by 1 person
    - 4
      
      hoakley on January 25, 2022 at 10:13 pm
      
      Thank you.
      Yes, unless you anticipate such problems you can end up with problems. I’ve generally kept to one QoS, and given the user control, which also lets me decide whether it does anything worthwhile.
      Howard.
      
      LikeLike
5

Enrico on January 25, 2022 at 11:23 pm

Every new article I hope you found a way to elevate priority but alas still nothing. Time to file apple feedbacks I guess?

LikeLiked by 1 person
- 6
  
  hoakley on January 26, 2022 at 8:20 am
  
  I will be writing more about this, but haven’t found a way to do this.
  Howard.
  
  LikeLike
- 7
  
  Tony on January 30, 2022 at 11:13 pm
  
  I think you may be best to create your own thread (rather than use one of GCD’s worker threads) to be able to do this. Apple’s Threading Programming Guide* describes how to create a thread using NSThread and then to change its priority (‘Setting the Thread Priority’ in the ‘Thread Management’ section). The POSIX route (pThreads) may be better documented since it’s in most UNIX material.
  
  The GCD worker threads apparently have some capability. See ‘Changing the Underlying Thread Priority’ in the ‘Operation Queues’ section of the Concurrency Programming Guide**. I haven’t read this in any detail so apologies if it’s not useful.
  
  *https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Multithreading/Introduction/Introduction.html#//apple_ref/doc/uid/10000057i-CH1-SW1
  **https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/Introduction/Introduction.html#//apple_ref/doc/uid/TP40008091-CH1-SW1
  
  LikeLiked by 1 person
  - 8
    
    hoakley on January 30, 2022 at 11:57 pm
    
    Thank you, Tony.
    Yes, I don’t think there’s any problem with changing QoS within the app. The problem we’re trying to crack is how to do that outside an existing process, e.g. change the cluster type allocation to backupd so that it can run of P as well as E cores.
    Howard.
    
    LikeLike
9

Tony on January 30, 2022 at 10:59 pm

Thanks for this, it sheds a little light on an interesting area.

I doubt it affects your analysis but a little clarification on the operation of GCD: the operations are pieces of work (eg methods or blocks) held in queues for sequential or simultaneous (concurrent) execution. The actual threads are held in a “pool”, being a set of worker threads that will be assigned operations to execute. This saves the overhead of continually creating/configuring/destroying threads. Presumably the number of worker threads is somehow linked to the number of CPU cores available.

Apple’s dated Concurrency Programming Guide* helps a little here. The confusingly-titled section ‘The Move Away from Threads’ describes how GCD can manage threads on behalf of an app instead of it doing so explicitly. I thought there was a better Apple description of the thread pool but I can’t find it; there are many third-party descriptions online though**. I think the worker threads are those mentioning “workq” and “wqthread” in the debug section of Xcode’s sidebar (when an app’s run is paused). It my be coincidence but, on my M1Max MBP, I see ten of these in the app I am debugging.

*https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ConcurrencyandApplicationDesign/ConcurrencyandApplicationDesign.html#//apple_ref/doc/uid/TP40008091-CH100-SW1
** Eg: https://livebook.manning.com/book/objective-c-fundamentals/chapter-13/

LikeLiked by 1 person
- 10
  
  hoakley on January 30, 2022 at 11:54 pm
  
  Thank you. It may that the actual threads then correspond to the ‘slots’ for each cluster?
  Howard.
  
  LikeLike
11

Harald Striepe on April 5, 2022 at 9:29 pm

I have been looking at the CPU history of my new M1 Ultra and the allocation clearly uses a different scheme than Intel. The second 8 Performance Cores do nothing unless the others overflow. Efficiency Cores tend to stay busy, with the first two getting most of the load.

LikeLiked by 1 person
- 12
  
  hoakley on April 5, 2022 at 10:35 pm
  
  Thank you, Harald.
  I wondered whether the two pairs of E cores would be used as one or two clusters. I think those 8 P cores will be used as two clusters of four, with each of the four clusters recruited as the CPU load increases.
  Howard.
  
  LikeLike

·Comments are closed.

Share this:

Related