hoakley November 4, 2021 Macs, Technology

M1 Pro First Impressions: 2 Core management and CPU performance

If you’ve read the excellent performance analyses already published by AnandTech and others, by now you’re probably thinking that the CPU cores in the M1 Pro and Max are among the less innovative parts of the M1 Pro and Max SoCs. I hope in this article to show you how they’re managed quite differently from those in the original M1, and how that might affect what you can do with these latest models.

On paper, the major difference between the M1 and M1 Pro/Max CPUs is core count: the original M1 has a total of eight, half of which are E (Efficiency, Icestorm) and half P (Performance, Firestorm) cores. The M1 Pro and Max have two more cores in total, but redistribute their type to give eight P cores and only 2 E cores. It would be easy to conclude that experience with the first design showed that the E cores were only lightly loaded, so fewer were needed, but delivering better performance to the user merited twice the number of P cores. While that may well be true, you also have to look at how the cores are actually used.

As I’ve already described, in the M1 tasks like background services with a low QoS are run exclusively on its E cores, while those at any of the three higher levels of QoS are scheduled to use both P and E cores. As with symmetric multiprocessing CPUs, load on different cores is otherwise fairly well-balanced.

To illustrate this balance, here’s a heavy load placed on the eight cores of an Intel Xeon W CPU in my iMac Pro:

All eight real cores on the left are fairly evenly loaded; the eight cores on the right are virtual, and realised by hyperthreading, which wasn’t required here.

Here are the four E and four P cores under load from a high QoS task in the M1 SoC in my M1 Mac mini.

Again, load is normally spread fairly evenly.

Here’s the equivalent, a hefty compression task using the AppleArchive library, running on an M1 Pro. As in the above charts, the red bars represent the System load, and those in green are from the User. I’ll refer to cores by their number and type, so that core 2E is the second E core, and 5P is core 5, a P core.

Load on the two E cores is evenly balanced, for both User and System, but there are consistent differences in load on the P cores. The first four (3P to 6P) are more heavily loaded throughout, although the System (red) load is more even across all eight P cores. Even within the first four P cores there are differences in User load: 3P bears a heavier load than 6P, for instance.

This isn’t always the case, though.

When running the Geekbench 5 CPU benchmarks, early tests are still confined to the first four P cores, but their later tests are more evenly distributed across all ten cores. Surprisingly, these benchmarks seldom exceed 50% load on any of the ten cores, which raises the question of how accurately they represent maximum CPU performance.

Here’s a more obvious example, again using AppleArchive on hefty tasks.

These load the E cores to 100%, with a high proportion of that being System. On the P cores, while the System load is fairly evenly spread, the User load is highest in the first four P cores, and least in the second group. Even within those two groups, the lowest number core (3P and 7P) bears the heaviest User load, and the highest number core (6P and 10P) bears the lowest.

My app AsmAttic gives precise control over the distribution of numeric benchmark tasks on ARM processors. I therefore turned to that to look in more detail at these unusual patterns of core use.

This image shows a series of benchmarks being run using two different QoS levels. The two E cores were loaded with one slow task first, then two slow tasks, which brought them to full load. Over the same period, a succession of shorter tasks at high QoS levels were run on the P cores. These were only loaded on the first group of four P cores, and during this whole period the second group of P cores were almost completely unloaded.

For this series of tests, AsmAttic had been configured to use a maximum of four concurrent processes. When that is changed to eight, which could have been loaded onto all eight P cores, its tests remain constrained to the first group of four P cores.

Performance differs little on the P cores of the M1 Pro and the original M1. For example, the same hand-coded assembly language for floating point dot product calculation using the ARM Neon vector unit took 0.126 seconds on the M1 Pro (mains power) and 0.142 seconds on the M1. The M1 Pro time is just under 90% that of the M1.

Differences in performance were much greater on the E cores, where they also varied according to whether the MBP was running on battery alone:

M1 0.409 s (100%)
M1 Pro on battery 0.340 s (83%)
M1 Pro on mains 0.169 s (41%)

Those results are for a comparable benchmark using Apple’s Accelerate library dot product function.

Taken together, these results show that process allocation to cores in the M1 Pro and Max is carefully managed according to QoS (as in the M1) and between the two groups of P cores. This management aims to keep the second group of P cores unloaded as much as possible, and within each group of P cores loads lower-numbered cores more than higher-numbered. This is very different from the even-balancing seen in symmetric cores, and in the M1.

The end result is that the two E cores in the M1 Pro/Max are significantly faster (in some respects, at least) than the four E cores in the M1, although the E (but not the P) cores are slowed when running on battery alone.

Because of this sophisticated asymmetric core management, measuring CPU performance in the M1 Pro/Max is more complex than when cores are managed symmetrically. While running on battery alone shouldn’t impair the performance of CPU-bound tasks run at higher QoS, you should expect background services run on the E cores alone to take longer.

There are also interesting implications for developers wishing to optimise performance on multiple cores. With the advent of eight P cores in the M1 Pro/Max, it’s tempting to increase the maximum number of processes which can be used outside of an app’s main process. While this may still lead to improved performance on Intel Macs with more than four cores, the core management of these new chips may limit processes to the first block of four cores. Careful testing is required, both under low overall CPU load and when other processes are already loading that first block. Interpreting the results may be tricky.

I suspect that Apple has done this to further improve energy efficiency and ensure good responsiveness to new CPU-intensive tasks.

I eagerly look forward to seeing more detailed information explaining how the E cores in the M1 Pro/Max appear to outperform those in the M1.

30Comments

Add yours

1

Nick C on November 4, 2021 at 8:26 am

It does rather look from the above as if the new chips are not being fully unleashed most of the time.

Is there any way to force all cores to be used at maximal performance, or do you think that they are maybe being actively constrained to be “as fast as necessary” for use in the laptops? Or is there possibly some limitation in the OS architecture at present which limits how fast they CAN run? That last point seems highly unlikely to me as apart from anything else it would negate the whole point of the unified memory and super-fast fabric.

Does this maybe mean that we might see the same chips performing even better (maybe by as much as twice) when they are employed in desktops like the iMac (maybe) and new MacPro, where power and heat limitations are essentially not a factor?

LikeLiked by 1 person
- 2
  
  hoakley on November 4, 2021 at 9:58 am
  
  Thank you.
  I’m not sure I’d draw those conclusions, as you’d need to look carefully at entirely CPU-bound processes to see. While my AsmAttic tests are pure CPU, they mix Neon with regular floating point, and almost no integer load, so would need more careful design to fully load P cores, I suspect.
  There is, though, a good reason why no CPU should allow all its cores to be fully loaded except in extremis. They also have to support a GUI and a user who could want to interact while they’re working on a high priority task. So I would think that keeping something in reserve is normally quite a good plan for the P cores, which have to service the user most.
  That said, I’m surprised the Geekbench tests didn’t achieve a higher loading. They’re supposed to be CPU benchmarks, but here seem constrained in their load, and I don’t think that’s core management that’s responsible, as the loads shown are seldom as high as 50%, even though they seem to load each of them.
  I should add that I couldn’t hear any fans during any of these tests, so I don’t think thermal load came into play. I also performed most of them when the MBP was connected to the mains, except where indicated.
  More exploration and different tests are required.
  Howard.
  
  LikeLike
3

CraigM on November 4, 2021 at 11:18 am

I think some of this behaviour may be attributable to hardware factors of the design (not just QoS, software job allocation, etc), as the physical design seems to cluster 4 cores into a module which has its own distinct cache/memory interface and then multiply that design to reach capacity.

That would tally somewhat with the power management results where its the cluster which is managed, not the individual core, etc.

LikeLiked by 1 person
- 4
  
  hoakley on November 4, 2021 at 12:29 pm
  
  Thank you. Yes, it does match what has been observed and proposed elsewhere.
  For SMP, there seems little advantage to managing cores any more than balancing their load, and activating hyperthreading etc.
  In the M1 SoC, the only decision-making seems to be which type of core (P or E) to use, and within those load seems fairly evenly balanced.
  In the M1 Pro (and presumably Max), the top-level choice is P or E again, which can be set by the programmer using QoS. Then, if the P cores are to be used, there’s the question of whether to use one cluster/group or both. Within those, there also seems to be a strategy of loading the lowest-numbered core most, and the highest least, although I don’t know how consistent that is.
  Whether these strategies are determined within the SoC or macOS I have no idea, and look forward to someone who knows what they’re talking about explaining what is going on here.
  I suspect that some is designed into the SoC, and some determined at a kernel level, which would give maximum flexibility. Some of this could be communicated to the SoC in the form of the pseudo-NOPs which are passed in instructions.
  Howard.
  
  LikeLike
  - 5
    
    name99 on November 6, 2021 at 6:22 pm
    
    The scheduling (and control of clusters) is absolutely done on a per-cluster basis.
    The details are (of course!) complex but described in a set of patents.
    
    The primary patent is here: https://patents.google.com/patent/US10884811B2
    but don’t read that first!.
    
    First read
    https://patents.google.com/patent/US20170357302A1/en
    which describes how various parts of the hardware collect information at various granularities to see how energy is being used. (This varies from cycle-by-cycle — the digital power estimator, which must ensure that current draw in any cycle never exceeds what the battery can provide) to thermals which vary no faster than over seconds or so.
    
    Then read
    https://patents.google.com/patent/US11080095B2/en
    which talks about the technology used (and usable) by the class of apps that knows their performance requirements (eg video decode or games must produce a frame at a certain rate). These deadlines, and the rate at which they are being approached, are communicated to the OS, and the OS can tell if it’s appropriate to speed up or slow down the system.
    
    Finally the master patent builds on these ideas. Essential concepts include
    – scheduling is by thread group, not thread
    – threads are aggregated into groups with similar requirements. This can be because they want similar performance, because one depends on another, because the developer grouped them together, because the OS grouped them together (UI thread with a compute thread), etc. The groupings are dynamic and vary as the environment changes and as code behavior/statistics change.
    – the machinery is described in terms of either deadline threads (easy to understand, just boost or reduce performance every so often to stay on track for the deadline) or E -background threads (optimize for minimum energy use, which may or may not be “run CPU as slow as possible” depending on how much activity is happening elsewhere, how much the thread needs to go off-cluster and off-chip [to DRAM], etc)
    
    – less obvious is what happens with “normal” code. I think that’s not discussed because it’s not new, but the idea seems to be that once the OS has constructed optimal thread groups by the above criteria, it then adds in any “normal” threads, as optimally as possible, to any thread groups (E or P) that have slots free.
    Once this is done, the thread group then runs at the optimal behavior for the “most aggressive” thread in the thread group.
    Remember that many things are locked together — a cluster runs at a single frequency, the SLC or DRAM run at a single frequency — so if you have high perf code, it drags upward everything that gets scheduled with it.
    
    The impression I get from the patent is that scheduling is considered very much a “split” task.
    There is a world where we have high perf code, and in that world everything runs as fast as possible (given thermals, current draw, etc).
    And there is the other world (which for most users is their machines 99% of the time) when the user is reading the screen, typing, watching a movie, or whatever and now the imperative is “minimize energy while hitting deadlines”. You only enter this world when there is no high QoS thread active, but that’s actually most of the time!
    
    Clearly the tricks we play to lock code to an E-core kinda give sub-optimal results by this logic — we’re telling the OS, via the QoS, to optimize for energy,
    but in other circumstances if the *same code* ran on an E-core as a *high perf* spill-over from all P-cores active, the OS would make very different DVFS decisions for the E-cluster (even apart from the SLC and DRAM running a lot faster).
    
    That seems to be what you are seeing in your E-core results.
    For battery power, the system maintains that same split of “high perf” vs “optimize for energy”,
    whereas on mains maybe there is no longer any attempt to optimize for energy — either we run on deadline, or we run as “race to sleep”, and get the background stuff done ASAP then power down everything?
    That seems like closer to what most users would want on mains power — prioritize getting things done, even background things like indexing and backup, over obsessively optimizing for energy.
    
    LikeLiked by 1 person
    - 6
      
      hoakley on November 7, 2021 at 11:24 pm
      
      Thank you.
      Yes, I’ve read those patents, as referred to in your superb paper. What I wonder is whether Apple’s implementation in the M1 Pro and Max has changed with the experienced gained with the original M1 chip, which does appear to be a fairly direct implementation of what appears in those patent descriptions.
      I need to do more runs comparing mains and battery power, and with Low Power Mode enabled to see the interactions between those and observed performance and core use. I also need to use code which places less emphasis on floating-point: the ‘idiomatic’ Swift is perhaps a more typical example of mixed code, as a lot of its inefficiencies are down to the overhead in manipulating sequences in order to run the floating-point calculations.
      Howard.
      
      LikeLike
7

Duncan on November 4, 2021 at 2:39 pm

I predict that soon we’ll see a range of cores from Efficiency to Low(er)-Performance to High-Performance, and perhaps even more intermediate distinctions as this technology develops. As prioritization routines get more sophisticated I imagine finer and finer optimization between the various tasks and the hardware in place to handle them. No need to fire up a high-performance core if an intermediate core is available that just matches the task.

LikeLiked by 1 person
- 8
  
  hoakley on November 4, 2021 at 6:47 pm
  
  Thank you. Although I wouldn’t be surprised, there are already three different cores in the M1 series SoCs, so we’re getting there. The third type is a cut-down version used throughout the Fabric and other parts of the chip, including one known as the ‘always-on processor’, although of course it shuts down when the Mac is shut down.
  Howard.
  
  LikeLike
- 9
  
  name99 on November 6, 2021 at 6:34 pm
  
  There are limits to how far to push this differentiation, and I suspect Apple is at the sweet spot. The 3-way split on the recent high-end ARM chips strikes me as more marketing than sense…
  
  The basic issues are
  – the OS (and even the developer) doesn’t have a fine granularity of what they want. The developer only really knows “this should be fast” or “this can be slow”.
  
  – where finer granularity can help (deadline based code, as described in my large comment above) the OS can hit that granularity with DVFS. Even for high perf code the OS can practically make use of DVFS while also practically honoring the user and developer’s desires — start the code running at the lowest DVFS but ramp up very fast. If the high perf code runs for a long time, you’ll be at max perf within a few ms anyway; if the high perf code (respond to a keystroke or whatever) only runs for an ms, who cares if it ran at a DVFS state of half the max frequency.
  
  A better use of speciation like this is more specialized accelerators (and accelerator-like code) which may live in places you do not expect; like if you can off-load most network interaction to specialized network logic rather than a core, than the operation of handling a constant stream of packets (eg AirPods audio, or watching a streaming movie) can mostly run on the low-energy specialized logic, with the energy cost of a core only required on the rare occasions where something weird happens (error, start and stop the stream, change the bit rate, that sort of thing).
  
  LikeLiked by 1 person
10

Craig Doran on November 4, 2021 at 4:47 pm

It would be interesting if the performance core activation pattern observed on the 8-core M1 Pro model showed consistency of the 2 E cores under consistent load but then there were 3 heavily used P cores and 3 lazy P cores. I would think less expected would be 4 hyperactive and 2 lazy.

LikeLiked by 1 person
- 11
  
  hoakley on November 4, 2021 at 6:51 pm
  
  Thank you.
  One note of caution here is that my understanding of the 8-core version is that it’s at heart a 10-core in which two of the P cores didn’t meet the spec for ten. So it may vary according to which those two cores are. I would suspect that most run with three in each of the clusters/groups rather than four, but that could vary.
  Does anyone here have one of the 8-core M1 Pros to test, perhaps?
  Howard.
  
  LikeLike
  - 12
    
    Gramatan on November 5, 2021 at 2:33 pm
    
    I have one on order for early December. I could run some tests for you when it arrives.
    
    LikeLiked by 1 person
    - 13
      
      hoakley on November 5, 2021 at 11:11 pm
      
      Thank you. I’d most interested when you do get it if you could watch CPU History in Activity Monitor – although it’s useful to run tests, basic observation is the most important thing.
      Not long to go now!
      Howard.
      
      LikeLike
14

crgmrgn on November 4, 2021 at 7:32 pm

I’d also caution against extrapolating too much from the M1 systems behaviour as we have no evidence yet that a M1 efficiency core directly equates to the M1Pro/Max efficiency core, etc. I’d be inclined to believe that in the intervening development time that M1Pro/Max benefitted from that its quite likely that the overall efficiency of a core has been advanced … for instance if they’ve realised any power savings in the implementation of their Perf cores (not being run to their max) then that might explain their willingness to reduce efficiency core count. ie. a lightly loaded M1Pro/Max perf core might be significantly more efficient that its equivalent on M1 and approach M1 efficiency core levels of operational power consumption.

The flexibility that this would buy would be immense, whilst the true efficiency cores would provide the benefit of running up just one core rather than the cluster and hence still be optimum for truly lightweight tasks, like mail checks during sleep, etc. What we really need are a perf/power matrix for the cores of each gen and look at how the graph slopes overlap each other … tough admittedly without Apples insight.

LikeLiked by 1 person
- 15
  
  hoakley on November 4, 2021 at 11:08 pm
  
  Thank you.
  There certainly seems to have been significantly improved performance in the M1 Pro/Max E cores, where 2 E cores outperform the 4 E cores of the M1.
  Howard.
  
  LikeLike
16

Fazal Majid on November 5, 2021 at 12:41 am

I hope they allowed for priority inversion and also that low-QoS system services may be blocking for higher-priority interactive processes, e.g. Mail.app waiting for Spotlight to complete indexing. Ideally there would be a way for a high-QoS process blocking on a system service to bump it up to a P-core.

LikeLiked by 1 person
- 17
  
  hoakley on November 5, 2021 at 2:12 pm
  
  Thank you. I wasn’t aware that Mail waited for indexing to complete. That seems a strange behaviour, given that on any Mac, such indexing is a background activity which could take a long time to complete.
  Howard.
  
  LikeLike
  - 18
    
    name99 on November 6, 2021 at 6:42 pm
    
    Certainly this is acknowledged and handled by scheduling both at the low CPU level
    – if a P core is waiting on a lock held by an E-core (or similar sorts of patterns, like a P core constantly reads data from the E-cluster cache) then the E-cluster is boosted in performance
    – if a high perf thread is linked via some OS construction (pipe, semaphore, …) to a background thread, the background thread will be boosted into the thread group of the high perf thread.
    
    The problem may arise if there is an “implicit” dependence between two processes something like Mail.app is waiting for Spotlight to create a particular file, but the scheduling part of the OS has no knowledge of this linkage.
    The solution, of course, is to have Mail.app inform the OS of the dependency, and there are a variety of ways to do this.
    
    I expect that over the next few years these sorts of linkages will be sorted out, in the same way that when Apple prioritized system energy usage, it took a few years from providing a set of primitives in the OS to having every Apple app and framework (and even more so for third party developers) using those tools in an optimal fashion.
    
    LikeLiked by 1 person
    - 19
      
      hoakley on November 7, 2021 at 11:32 pm
      
      Thank you.
      I’m not convinced that Mail does have to wait for indexing to complete in that way. mds_stores has a burst of initial activity preparing the Spotlight indexes soon after startup, during which searches are likely to be refused or delayed. Although some later actions produce spikes in activity, for example following a Time Machine backup, I don’t think those affect the availability of other Spotlight indexes, for example those covering mail, and it would be very poor design if they did. So, when you receive new mail, there’s normally a brief burst of indexing activity by mds_worker processes, which completes very quickly. I don’t see why Mail should become unable to run searches throughout this.
      Howard.
      
      LikeLike
20

Boyd Waters on November 5, 2021 at 7:30 am

macOS has a “Low Power Mode” setting in the “Battery” Preference Pane.

I wonder how that affects the load distribution.

LikeLiked by 1 person
- 21
  
  hoakley on November 5, 2021 at 2:12 pm
  
  Thank you.
  I haven’t looked at it. I suspect the distribution is the same, but that peak loads are lowered.
  Howard.
  
  LikeLike
22

Apple M1 Pro / Max: Πάντα μικροδιαχείριση του πυρήνα για το Liquid macOS on November 5, 2021 at 10:30 am

[…] εργασίες. Ως αποτέλεσμα, ο προγραμματιστής Howard Oakley δείχνει στο blog του Αυτή η οργάνωση είναι διαφορετική και ότι οι πυρήνες […]

LikeLike
23

Fernando Pastrana on November 5, 2021 at 11:36 am

Great review. It will be interesting an app that simulate based on common software load how the M1, M1Pro and M1 Max will perform. So everyday users can identify which one is the best fit for them.

LikeLiked by 1 person
- 24
  
  hoakley on November 5, 2021 at 2:17 pm
  
  Thank you. I think the answers are already there in terms of what specifications are most suitable for different types of use.
  Howard.
  
  LikeLike
25

Adam Bridge on November 6, 2021 at 4:46 pm

I’m not sure I understand everything I know about the term QoS as you use it here. I think understanding the term more fully would help me understand your observations. Have you written about your definition of QoS?

Thank you.

LikeLiked by 1 person
- 26
  
  hoakley on November 6, 2021 at 5:02 pm
  
  Thank you – I’m sorry, yes, I have covered this before.
  QoS stands of course for Quality of Service, but in the context of macOS has very specific meaning. It’s one of the properties which the coder can set when running Processes. There are only four levels: the three highest run the code on all the cores that the system makes available, which normally means they run predominantly on the P cores. The lowest is intended for background services, and is almost exclusively used by macOS for Spotlight indexing and maintenance, and Time Machine backups, etc. The lowest is currently scheduled exclusively on the E cores.
  Setting the QoS is thus the only real tool that a developer has to determine the priority allocated to the Process, and which cores it will be run on. And it’s one that I use both for test purposes in AsmAttic, and for user control, in my compression utility Cormorant. A few other developers, like Mike Bombich in CCC, also give the user control over the QoS, but most don’t, and run background processes at higher QoS, so they don’t realise the benefits of the E cores.
  Does that explain a bit more? You’ll also have joy by searching this blog for the term QoS – plenty more to read!
  Howard.
  
  LikeLiked by 1 person
  - 27
    
    Adam Bridge on November 6, 2021 at 7:26 pm
    
    Thank you very much for your thoughtful reply. I will most definitely engage the Might Site Search to read your articles on QoS. I’ve only recently discovered your site (through a search for articles about JMW Turner!) and continue enjoy your Mac-related materials. I’ll be following your guidelines when my MacBook Pro arrives this week.
    
    LikeLiked by 2 people
28

Walt French on January 26, 2022 at 7:53 pm

Hi, I’m now looking at my Activity Monitor on my MBP Max Pro reporting about 25% “CPU Load” at the same time the CPUHistory shows its 2 efficiency cores—1E & 2E—running near-max, 3P+4P together totaling about half power; the other Ps at near-zero

Suggesting the 25% stat treats all 10 cores as equivalent. Which obviously, they’re not

Do you have a good SWAG at the relative throughput of an individual E core (or per-core for the pair), compared to an individual P core (or again, its contribution to a cluster)?

Thanks muchly for your input here!

LikeLiked by 1 person
- 29
  
  hoakley on January 26, 2022 at 10:36 pm
  
  Thank you.
  If you’d like to look at some of the more recent articles I’ve written looking in more detail at E and P core performance, you’ll find a great of information. These are generally listed in the M1 page, from the top menu under the banner.
  What Activity Monitor doesn’t tell you is the frequency those cores are running at. If an E core is in efficiency mode and ticking over at 1000 MHz, but has high active residency, then Activity Monitor will show it as close to 100%. Of course, as its maximum frequency is over 2000 MHz, that should really be less than 50%! On the other hand, when the P cores are loaded, they tend to run at over 3000 MHz, and don’t mess around much at lower frequencies. So what you’re seeing in Activity Monitor can be very misleading.
  Generally, an E core has about half the internal units (e.g floating-point, ALU) of a P core, and can run at a maximum frequency of 2/3rds of a P core. However, their throughput of instructions can come close (relative to frequency) to that of a P core.
  In normal use, the two E cores of an M1 Pro/Max will run most code as fast as all four E cores in the original M1 chip, which is achieved by frequency differences. And the throughput of the whole E cluster (2 or 4 cores) will be higher than a single P core, but less than two.
  Howard.
  
  LikeLike
30

Michael Tsai - Blog - M1 Icestorm Performance and Asymmetric M1 Pro Core Management on March 1, 2022 at 3:50 pm

[…] Howard Oakley: […]

LikeLike