hoakley February 7, 2022 Macs, Technology

When an M1 Mac mini is faster than an M1 Pro: contention and core allocation

So far my investigations of the performance of the Efficiency (E) and Performance (P) cores in M1 chips have been confined to running multiple threads in a single app. In the real world, processors are more usually running multiple processes which contend for resources including CPU cores. This article looks at how contention works out depending on the Quality of Service (QoS) assigned to different threads.

Model and methods

Two different apps are used here to compete for CPU cores: my free Cormorant streams files to compress (and decompress) them using multithreaded lossless compression in Apple Archive; my AsmAttic test utility runs tight CPU-bound loops of assembly code, as I’ve explained before. Both apps run their threads in Grand Central Dispatch queues with Quality of Service values set by the user for each test. AsmAttic also sets the number of threads and the number of loops in each thread.

Tests were run on two M1 Macs in Monterey 12.2. One, referred to as M1 mini, is an M1 Mac mini 2020 with 16 GB memory, an internal 500 GB SSD and the original M1 chip with one cluster of 4 E cores and one of 4 P cores; the other, referred to as M1 Pro, is an M1 MacBook Pro 16-inch 2021 with 32 GB memory, an internal 2 TB SSD and the M1 Pro chip with one cluster of 2 E cores and two clusters of 4 P cores each. By all expectations, running either of the test apps would be expected to demonstrate better performance on the M1 Pro compared with the M1 mini.

Additional tools used to examine performance include the powermetrics command tool and Activity Monitor’s CPU History window.

Uncontended performance on E and P cores

Before looking at the effects of contention on performance, I first looked at the two tests I intended to use in contention, running at the highest and lowest QoS levels.

Time to compress a 10 GB test file at highest QoS (33) was shorter on the M1 Pro as expected. The task completed in 5.6 seconds on the M1 Pro, and 8.2 seconds on the M1 mini. At this QoS, each test resulted in all available cores being recruited at their maximum frequency, and 100% active residency, according to powermetrics.

When compression was performed at minimum QoS (9), the M1 mini consistently completed the test in a shorter time than the M1 Pro. While the M1 Pro took 55.1 seconds, the M1 mini accomplished the same task in only 37.3 seconds. I have previously reported that, when running tests on the M1 Pro’s two-core E cluster, cores are run at higher frequency than when running the same test on the M1 mini’s four-core E cluster, and that was seen here too. When the M1 Pro was running the compression test solely on its two E cores, their frequency was 2064 MHz, while the four E cores in the M1 mini ran at a frequency of only 972 MHz. Despite that difference in frequency, the M1 mini required just 68% of the time taken by the M1 Pro.

To confirm that this wasn’t the result of Cormorant being built for an older version of macOS (Big Sur), I built and notarized a new version using Xcode 13.2.1. Using that new version, times observed were unchanged, and the M1 mini remained significantly quicker at compressing on its E cores alone.

Results from the floating point test were consistent with my previous observations, that the two E cores in the M1 Pro are run at higher frequency to compensate for their number, resulting in better performance on the M1 Pro regardless of QoS. At the highest QoS, 10 threads completed on the M1 Pro in 3.6 seconds, while 8 threads took 4.0 seconds on the M1 mini. At the lowest QoS, 2 threads on the M1 Pro completed in 5.1 seconds, and on the M1 mini in 10.3 seconds, essentially the same time it completed 4 threads. High QoS resulted in E and P cores being run at their maximum frequencies, but at low QoS the two chips differed: the M1 Pro ran its two E cores at 2064 MHz, but the M1 mini ran its four E cores at 972 MHz.

Contention on E cores

When tested with contending processes and threads confined to the E cores by the lowest QoS, there were no surprises. With the difference in performance times of the two tests used, each run started with the compression task, then the floating point test was added to that and completed before the end of compression.

Adding only two floating point threads, the M1 mini completed compression in 42.4 seconds, with the floating point test taking 11.9 seconds within that; the M1 Pro completed compression in 63.2 seconds, and the floating point test took only 8.8 seconds. While the E cores of the M1 Pro were run at a frequency of 2064 MHz, even with both tests running concurrently the four E cores in the M1 mini remained at only 972 MHz.

Total elapsed time, within which both compression and floating point tests were completed, was shorter for the M1 mini with four floating point threads (47.9 s) than the M1 Pro running only the compression task (55.1 s).

Contention at intermediate QoS

Apple defines four QoS levels, numerically 9, 17, 25 and 33, of which only one (9) results in threads being constrained to one type of core. Threads at each of the three higher QoS can be run on either E or P cores, depending on allocation by macOS. When looking at the effects of different QoS it’s easy to conclude from uncontended testing that there’s little difference between those three levels. To get a better insight, I ran floating point tests at various QoS against compression at a QoS value fixed at 25, using just the M1 Pro.

m1qosContention

The table above gives, in the first column, the time in seconds for the compression task to complete. The second column gives the time in seconds for the concurrent floating-point test to complete, with its QoS given in the final column.

This shows the interaction between threads at different QoS levels. When the competing floating-point task has a lower QoS than the compression task, compression is only slowed slightly, and the time required for the floating-point task is more than doubled. When the floating-point task QoS exceeds that of compression, the former takes little longer than it does when run alone, and the compression task takes nearly twice as long.

While there are no surprises here, this demonstrates that allocating queues and threads an appropriate QoS is important even when using the three higher levels, which don’t constrain threads to E cores.

Conclusions

Although processes and threads run on both E and P cores complete more quickly on the M1 Pro, when constrained to the E cores some are significantly quicker on the M1 mini. This occurs despite the difference in frequencies of the E cores when running threads at the lowest QoS.
M1 mini and Pro chips run their E cores at different frequencies when running threads at the lowest QoS. The four cores in the M1 mini cluster are then constrained to 972 MHz, while the two cores in the M1 Pro cluster may be run at their maximum frequency of 2064 MHz. Code dependent on resources outside the cluster may still run more slowly on the M1 Pro despite that difference in frequency.
Contending threads from different processes are run concurrently on the core types to which macOS allocates them. Those at the lowest QoS are never run on P cores, even when the E cores are already fully loaded but the P cores are idle.
When run at any of the three higher QoS levels, macOS allocates priority to threads according to their QoS, so that those with higher QoS are given higher priority than those with lower QoS. Assigning an appropriate QoS to threads is therefore important in determining overall performance, particularly when threads are in contention with others. As a result, assessing performance without contention can be misleading.
Understanding core allocation and the interaction of QoS levels under contention are essential to achieving optimal app performance on Apple Silicon Macs.

16Comments

Add yours

1

Simon on February 7, 2022 at 5:19 pm

I do not understand the uncontended test.

The four E cores on M1 are always run at ~1 GHz, while the two cores on M1 Pro/Max can ramp up to ~2 GHz under load. You could imagine if you load 4 threads onto E cores (eg. using QoS 9) that M1 and M1 Pro/Max would perform equally since the Pro/Max will make up in clock what they lack in core count. You could then imagine that loading 3 threads onto E cores leads to a shorter runtime on Pro/Max because 3 threads running on two 2-GHz cores is still better than 3 threads running on 3 cores at 1 GHz. But if you take only a single QoS-9 thread, the Pro/Max shold not be at any advantage because macOS would see no reason to ramp up its clock. There is after all nothing else waiting to get done on those E cores.

So, for an uncontended test using a single thread, I cannot understand how an M1 Pro/Max can beat a regular M1. If it does, I would have to assume there’s other processes also trying to run on E cores as well and hence it’s no longer an uncontested test. Or is there some reason to believe M1 Pro/Max will ramp up its E clock when only a single thread is waiting to execute on it?

But what I understand even less is how in an uncontested test using a single thread an M1 could show any advantage over the M1 Pro/Max. If it’s truly uncontested what counts is core clock. And with M1 never running higher than M1 Pro/Max, I simply do not see how—in a truly uncontested test—the Pro/Max could be beat.

I must have thoroughly missed or misunderstood something here. Any clarification would be much appreciated, Howard.

LikeLiked by 1 person
- 2
  
  hoakley on February 7, 2022 at 5:58 pm
  
  Thank you.
  None of the tests above use a single thread. Generally, I’m matching the number of threads to the number of cores they’re going to be run on. However, the compression test has no control over the number of threads, as all the work is done by Apple Compress, which is claimed to be “multithreaded”, and certainly appears to be during these tests.
  The figures above, and in my previous articles, show that when running my in-core tests, the M1 Pro E cores perform better than those in the original M1 because of the frequency difference. So running 2 E cores at ~2 GHz is better than 4 E cores at slightly less than 1 GHz.
  However, those tests are artificial, in that they don’t use memory, storage, the ROB, or even cache – everything is done in tight loops using registers.
  Turn to Apple Compress, and you’re looking at streaming from and to storage, memory access, ROB, and cache is very important. The difference in performance between the original M1, running its 4 E cores at just under 1 GHz, and the M1 Pro, with 2 E cores at just over 2 GHz, isn’t small or subtle: the M1 mini completes in two-thirds of the time. In fact, as I wrote, it can run 4-core concurrent floating-point tests while compressing, and still be faster than the M1 Pro just doing the compression.
  In both cases, powermetrics reports 100% active residency on the E cores. There’s a small difference in power, with the original M1 usually using a little less, but compared to using a P core, this is small.
  You can float all sorts of hypotheses as to why this should be, but the important message is that, in some circumstances, the original M1 can perform better than the M1 Pro, when processes are limited to their E cores at lowest QoS.
  Howard.
  
  LikeLike
  - 3
    
    Simon on February 8, 2022 at 2:52 pm
    
    Ah, multiple threads. I knew I had missed something. Thank you, Howard. As always, much appreciated.
    
    LikeLiked by 1 person
    - 4
      
      hoakley on February 8, 2022 at 8:35 pm
      
      Interestingly, running just one thread on the E cores, the two chips are driven the same – that’s the one situation in which the M1 Pro E cores run at just under 1 GHz.
      Howard.
      
      LikeLike
    - 5
      
      Simon on February 9, 2022 at 5:28 pm
      
      But that would make sense, wouldn’t it? Even for two threads I guess. I would imagine it’s starting at three threads where the M1 Pro would need to ramp up its E clock so its two E cores can keep up with the M1’s cluster of four E cores.
      
      LikeLiked by 1 person
    - 6
      
      hoakley on February 9, 2022 at 9:58 pm
      
      Ah – I can help you on two threads, because that’s when the M1 Pro increases the E core frequency, to make its two E cores as fast as four E cores ambling along at less than 1 GHz. If you look back at my earlier articles, you’ll see that documented there.
      Howard.
      
      LikeLike
7

Pierdamiano on February 12, 2022 at 6:45 pm

One reason the M1 mini 16GB is faster then the M1 Pro, could be its memory subsystem.
The mini has 4 DRAM channels populated with twice the standard density DRAM ICs (standard RAM size for this “legacy M1” is 8GB). The 16GB LPDDR4 used in the mini are exposing the DRAM controller to an higher Bank Level Parallelism (MLP) than what the M1 Pro can do.

The 32GB RAM of the M1 Pro, being LPDDR5, is most likely a higher density ICs. This means the DRAM controller is either seeing a lower number of channels and/or a lower amount of Chip Selects (DIMMs slots equivalent in the old Macs).

The compression algorithm may express lots or random/indirect memory requests (e.g. pointer chasing), preventing high efficiency in the CPU’s pipelines from happening. When such a code-execution condition is present, the only relevant performance factor remaining is the memory subsystem.
Most likely if you would perform the same M1 Pro test on a 64GB configuration, you should see the effects of this phenomena. I assume the 64GB version is exposing the Memory Controller to a higher MLP than the 32GB variant.

Knowing how DRAMs work, I decided to get my M1 Air with 16GB instead of the 8GB configuration!

LikeLiked by 1 person
- 8
  
  hoakley on February 12, 2022 at 9:58 pm
  
  Thank you.
  That is of course perfectly feasible, although I’m very surprised that Apple made such as design mistake, if it’s correct. That’s particularly important as memory is of course Universal, so poor performance would affect the GPU, Neural engine, and everything that accesses memory.
  Howard.
  
  LikeLike
  - 9
    
    Pierdamiano on February 13, 2022 at 5:41 pm
    
    I wonder how much better could an M1 Pro_64GB be vs. the 32GB variant, when running this same compression benchmark. The new LPDDR5 standard includes lots of configuration options, and some are programmable within the device itself!! ..so I hope Apple has reserved some space for improvement in the configuration setup of its Memory Controllers. Do you happen to see any of such options from the BIOS menu?
    
    Peak Memory Bandwidth is one thing. Sustained Memory Bandwidth when under random load is another :)
    
    LikeLiked by 1 person
    - 10
      
      hoakley on February 13, 2022 at 6:01 pm
      
      Thank you. I think the better comparison would be against the M1 Max.
      What BIOS menu are you referring to, please?
      Howard.
      
      LikeLike
11

Tony on February 13, 2022 at 8:29 pm

A bit late to the party here but the M1 Pro in the MacBook Pro may be slower than the M1 in the Mini by design.

Your measurements of (QoS 9) threads are low priority by definition. The scheduler has been told that there is no overriding necessity for the shortest possible completion time. The MacBook is a battery-powered portable machine whilst the Mini is a mains-powered desktop machine and the scheduler’s priorities may be different across the two. If there is a power-efficient M1 Pro mode/speed that we can’t easily see, the scheduler might well exploit it on the power-disadvantaged portable in the interests of battery life over performance.

This is pure speculation of course but Apple has been eking out batteries in iOS for a long time now.

LikeLiked by 1 person
- 12
  
  hoakley on February 13, 2022 at 9:36 pm
  
  Thank you. I don’t agree.
  If Apple wanted the E cores of the M1 Pro and Max to run at their most efficient, then they’d be limited to just under 1 GHz frequency. Equally, if Apple didn’t want to limit those on the M1 Mac mini, then it would run them at just over 2 GHz. Instead it does the exact opposite, which increases power consumption on the notebook, and minimises it on the desktop.
  But this isn’t about the cores (where I’ve shown that the M1 Pro does outperform the M1 Mac mini) – it’s about out-of-core factors such as speed of memory access. Why on earth should the E cores of the M1 Pro and Max, which are more likely to be used for high-performance work (as they’re also recruited for higher QoS threads), be more restricted in their memory (etc.) access?
  I should also emphasise that all tests are run on Macs connected to mains supply, and with Low Power Mode disabled.
  Although I haven’t looked at an original M1 notebook, as I passed mine on before starting this, I’m confident that it too would behave the same as the M1 Mac mini.
  Howard.
  
  LikeLike
13

Tony on February 13, 2022 at 11:12 pm

I’m sceptical of the memory speed theory: Apple has made much of the memory bandwidth of the M1 Pro/Max. Have I missed a bit, is there some evidence or is it just a possibility? I still think that there may be system-level subtleties in why a low priority task takes what appears to be a surprisingly long time to run. The fact that the portable performance doesn’t vary over mains/battery operation does not necessarily disprove this – why use more power than you need to when you’ve been told there’s no rush? Just because you can do the job faster doesn’t mean you should.

To digress a little: surprisingly, running a core at a slower clock speed is not necessarily the most power-efficient. Some years ago I talked to ARM about big.little and they told me that the most power-efficient way of using a core was to power it up, run it as fast as possible to do the job and then power it down again. Bearing in mind that Apple does not use ARM designs (only the ISA), this may or may not be relevant but it does illustrate possible subtleties (powering only parts of the chip at different times, perhaps).

LikeLiked by 1 person
- 14
  
  hoakley on February 13, 2022 at 11:49 pm
  
  Thank you, Tony.
  If you look at the above comments, we’re told that M1 Pro memory access is slower, which surprised me, as I had though the opposite was true. But it’s not something that I’ve looked into.
  It’s easy to see what is efficient in M1 chips, as powermetrics provides cluster power measurements, although how accurate they are I don’t know.
  The observation is quite straightforward: run tight loops of assembly code on E cores, without memory access, etc., and at QoS 9 they run faster on the M1 Pro’s E cores than on the original M1 Mac mini’s. Run more real-world tests such as Apple Archive compression, still at QoS 9, and the original M1 Mac mini completes the same task in 67% of the time that the M1 Pro does. If you can explain how or why that happens, I’ll be over the moon!
  Howard.
  
  LikeLike
15

Tony on February 14, 2022 at 12:39 pm

I can only speculate as to causes because modern devices are far from (CPU+RAM)*clock=performance and we do not have access to Apple’s detailed architecture. However, (re-)optimisation of the processor for real-world tasks is a possible one. The chip’s caches have changed and the optimisations may be differently targeted based on experience with the M1.

The AnandTech analysis of the M1 Pro/Max is useful here*. This refers to the system-level cache (between the RAMs and the CPUs), saying “While being much larger, it’s also evidently slower than the M1 SLC”. However, they then touch on the real world aspect saying “In practical terms, because the SLC is so much bigger … performance shouldn’t regress”. A relatively simple, repetitive piece of code will not see the benefit of a bigger cache but will see the detriment of a slower one. BTW, they also say of the basic RAM performance “DRAM latency, even though on paper is faster for the M1 Max in terms of frequency on bandwidth”.

So there are whole-SoC reasons that may impact the results you’re seeing. However, I return to the overall system design: we can be confident that high performance tasks on cores designed for non-time-critical work is not in the sweet spot of the device’s optimisation. That non-optimisation may be further skewed on the newer devices than the original M1 (we can definitely see this in the SLC choices) and explain at least some of the results you see.

I hope this helps, it’s a fascinating topic and thanks for exploring it.

* https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2

LikeLiked by 1 person
- 16
  
  hoakley on February 14, 2022 at 3:20 pm
  
  Thank you. Yes, that’s what I referred to in my article above. However, pinning down what those reasons are doesn’t seem easy, does it?
  I still remain slightly boggled that, when running Apple optimised code, an M1 Pro could be out-performed by an original M1 Mini.
  Howard.
  
  LikeLike

·Comments are closed.

Share this:

Related