hoakley January 3, 2022 Macs, Technology

Power, frequency, management: how M1 E cores win

Yesterday, I gave some preliminary results from further testing of the performance of Efficiency (E or Icestorm) cores in original M1 and new M1 Pro chips, in which I showed how the M1 Pro’s two E cores can match the performance of the four of an original M1. This article provides more detailed evidence to support that claim.

It relies on an earlier series of five articles, which provide AsmAttic, the testing app that I use, and explain my rationale and previous results. I won’t repeat them here, but refer you to:
How can you compare the performance of M1 chips? 2 Core allocation
Comparing performance of M1 chips: 3 P and E
Comparing performance of M1 chips: 4 Icestorm
Do M1 Pro and Max CPUs run slower on battery?
Anomalies in base performance of M1 cores
On M1 chips 1 + 1 = 4

Methods

The major addition to my methods, based on AsmAttic, is the concurrent use of the powermetrics command tool to obtain detailed information about the cores during testing. Typically, I set the floating-point test up so that each run takes around 10 seconds, start powermetrics logging with the command
sudo powermetrics -i 1000 -o filename.txt -n 10 --show-all
and immediately start the test processes in AsmAttic. This generates a series of nine sampled periods of just over a second each (about 1025 ms normally) saved to the text file for further analysis.

powermetrics is highly capable, but comes at the cost that it does alter the system that it’s measuring, imposing a small but significant load on the same cores that you’re measuring. It shouldn’t therefore be used to measure test runs for benchmarking purposes.

Results

In this article, I look at the total power used by core clusters, active residency (as opposed to idle, equating to Activity Monitor’s % CPU) for each core and in total, the frequency of each core and as a total, and a little at instructions retired for each cluster, a measure of throughput. I start with running one process on the E cluster.

The two charts here cover a period after the start of the test process, and before its completion, during which total active residency (shown with + points on the solid line, above) on the two E cores remained above 100%, in this case on the two E cores of an M1 Pro. This occurred because, during testing, background tasks for iCloud were being run. Power to the E cluster remained between 40-60 mW for much of the period sampled.

Frequencies of the two E cores varied between 1000-1200 MHz as a result of the load being slightly higher than had been intended. When the same single process was run on the four E cores of an original M1 chip, results were very similar, with a tendency to slightly lower frequencies because of the lighter total load, without the iCloud tasks.

With four test processes running on the original M1, all four cores remained at active residencies of close to 100% throughout the sampling periods, and the power drawn by the cluster ranged between 160-170 mW.

This test wasn’t affected by other background processes, and core frequencies remained close to 1000 MHz throughout, giving a total cluster frequency of just under 4000 MHz.

In those tests, E cores remained at steady frequencies of about 1000 MHz, coping with the additional processes by increasing active residency until it reached 100%. Loading the two E cores in an M1 Pro chip was managed quite differently.

This time, the tests were complete by about half-way through the sampling series. The first set of measurements, at time 0, include some time before the test processes were being run, and the last at load, at 4 seconds, includes some time after the processes had completed, leaving three sampling periods at full load.

In those periods, active residency on each core was 100%, and power consumption peaked at just under 200 mW.

In addition, core frequencies reached 1900 MHz, with a total of 3800 MHz for the pair. Once the test processes were complete, frequencies fell quickly to just over 1000 MHz, for a power consumption of under 10 mW.

Instructions retired, a measure of those completed, reached a peak of nearly 1.8 x 10^9 over the 1 second sampling period, at which time about 0.4 instructions were being executed per clock. The assembly language source of this test contains 4 floating-point instructions (one of which is a fused multiply-add), one integer instruction, and two branches. Counting each of those to make a total of 7 instructions per loop, that peak thus equates to about 2.6 x 10^8 loops per second, which is close to the measured value of just under 2 x 10^8 loops executed per second (see the chart in this article).

Retirements fell from that peak of just below 1.8 x 10^9 to less than 1 x 10^8 once the test processes had been completed.

My final charts show a different situation in which the two E cores are put under heavy load: when high-QoS processes, which are preferentially run on P cores whenever possible, overrun the P cores available, here when there are 10 processes. As the E cores are here standing in for P cores, you’d expect them to be run as fast as possible.

Peak E core performance is even shorter here, and is covered by just 2 sample periods during which active residency was at 100% and power consumption reached almost 210 mW.

For three sampling periods, both E cores were run at their maximum frequency of 2064 MHz. That compares with both clusters of P cores, which at the time were running at their maximum frequency of 3228 MHz, and consuming about 4 W per cluster, almost 20 times the power consumption of the E cluster (only 2 cores).

Summary of findings and conclusions drawn

macOS 12.1 manages the four E cores in the original M1 chip, and the two in the new M1 Pro, differently.
In response to a high load of low QoS processes, frequency of E cores in the original M1 chip remains about 1000 MHz.
In response to a high load of low QoS processes, frequency of E cores in the M1 Pro is doubled to about 2000 MHz.
This management policy ensures that demanding background processes with low QoS can complete in similar times on the original M1 and M1 Pro chips. Had M1 Pro chips followed the same policy as the original M1, those processes could have taken twice as long.
In response to a spillover load of high QoS processes from the P cores, frequency of E cores in the M1 Pro is increased to the maximum of 2064 MHz.
When running at 100% active residency and maximum frequency, the cluster of 2 E cores in the M1 Pro consumes 210 mW of power.
When running at 100% active residency and maximum frequency, each cluster of 4 P cores in the M1 Pro consumes 4 W of power.
Although there’s insufficient evidence here to conclude that all cores within a cluster are run at exactly the same frequency, their frequencies don’t appear to differ by much. Within a cluster, active residencies normally remain broadly similar but aren’t as closely correlated as frequencies.
In tight loops of predominantly floating-point code, accessing only registers as used here, performance measured by timing correlates closely with that measured by powermetrics as instructions retired.
The two E cores in an M1 Pro, when at 100% active residency and maximum frequency can outperform a single P core at 100% active residency and maximum frequency, while using one fifth of the power.

13Comments

Add yours

1

marcan on January 3, 2022 at 11:53 am

I posted this on HN, but since you might not see it there:

> Although there’s insufficient evidence here to conclude that all cores within a cluster are run at exactly the same frequency, their frequencies don’t appear to differ by much.

They do run at the same frequency. There is one hardware register to control the frequency for the whole cluster. Any differences you see are measurement errors in powermetrics; for example, for the P-cores, it has a habit of reporting the requested frequency for a core, which may not be achievable if it is a boost frequency (only available with some cores in deep sleep) or there is throttling involved, and then those metrics are incorrect.

Use the cluster frequency metric and ignore the core frequencies; the latter are not useful for anything, it’s synthetic data.

> The two E cores in an M1 Pro, when at 100% active residency and maximum frequency can outperform a single P core at 100% active residency and maximum frequency, while using one fifth of the power.

This varies by workload. I did some basic benchmarking to come up with numbers for this, since the Linux scheduler needs them to efficiently schedule processes. The ballpark is that one E-core has about 70% of the performance of a P-core per MHz. Given that E-cores go up to 2064MHz and P-cores up to 3204MHz, that means that a maxed out E-core can perform at around 45% of a maxed out, full boost P-core (which only works if the other 3 cores are in deep sleep); without boost it’s more like 48%. However, for some workloads this is way off – Dhrystone actually gives numbers for E-cores that are a mere 32% of the P-core numbers per MHz, making a maxed out P-core 5 times faster than a maxed out E-core, suggesting that that benchmark gets a huge boost out of the wider dispatch in the P-cores. It seems to be an outlier, though, since I haven’t found anything else with that disparity.

This is an obsolete branch that I’ll have to rewrite, but here’s the device tree description for the CPU clusters/cores and frequency/P-state settings for the M1 that I wrote for Asahi Linux:

https://github.com/AsahiLinux/linux/commit/6b4a8c07239a42093…

Another fun thing in there, which is replicating macOS behavior, is that when the P-core cluster frequency exceeds 2GHz it makes a change to memory controller power management settings, to increase the time-outs before it goes into lower power modes. This reduces DRAM latency under bursty/sparse load/store workloads, which slightly increases performance for some workloads. You can actually further disable DRAM PM and get even better latencies measurable in synthetic benchmarks (designed to measure this), though I wonder if it’ll make any difference in any real world workload at that point; I’ll have to test it one day.

LikeLiked by 1 person
- 2
  
  hoakley on January 3, 2022 at 12:13 pm
  
  Thank you.
  I do wish that powermetrics was open source, but maybe Apple thinks that would be giving the game away!
  Yes, I am aware of the evidence that all the cores in a cluster are run at the same frequency, which is why I couched my statement in those terms. I guessed that some would look at the slight disparities in frequencies and come to the false conclusion that they were evidence to the contrary. It’d be interesting to know how powermetrics tries to synthesise those, and why.
  I also appreciate your point about workload. With the exception of the last test in my suite (‘mixed’), all my tests are tight loops which only access registers, so don’t represent anything in the real world, and that’s not their purpose. What is important here is how QoS – which is the only way that macOS developers normally have of influencing scheduling in this way – is used by macOS to manage the cores, and how that management is clearly quite different for the types of M1 chip now available. Several had already asked me whether having only 2 E cores would make the M1 Pro/Max slower at running background tasks like backing up, and I think this answers that question for macOS.
  I’ll be very interested to see how Asahi Linux schedules the different cores, and what access it gives to the developer.
  Howard.
  
  LikeLike
  - 3
    
    marcan42 on January 3, 2022 at 2:08 pm
    
    I think powermetrics just logs what the OS intended per-core frequency is, but then that gets collapsed down for the hardware (max freq of intended for all cores is requested, then the hardware caps it depending on restrictions due to core activity or throttling). It’s quite likely this dates back to Intel stuff with per-core clocks, and it just became inaccurate now that they all get collapsed down to 1 per cluster.
    
    macOS definitely has some interesting behavior with QoS, e.g. I believe that even if you set your level to interactive/performance, although you *mostly* get scheduled on the P-cores, sometimes you end up on an E-core anyway, briefly.
    
    Linux lets you play with cores however you want; I don’t know that there’s a concept analogous to QoS, but you can set any process/thread to run on any subset of cores. By default the scheduler uses utilization and relative core capacities to try to schedule things in an efficient manner on big.LITTLE machines like this one. I’m not sure if there is a way to “cap” cluster capacity like macOS seems to, to prevent “background” threads from pushing the E-cores above 1000MHz on M1s; Linux will probably happily bump the cpufreq up to 2GHz as soon as a single process pinned to those cores has 100% utilization. You can manually cap frequency for a cluster though, I just don’t think it’s automated based on running processes like it seems to be on macOS. Since this will by some metric be the first “serious” big.LITTLE platform running proper desktop/laptop Linux, there is room for research here figuring out how to pin system services to clusters and cores to improve efficiency. I wouldn’t be surprised if we end up with some Linux scheduler improvements along the way. E.g. maybe some cgroup settings to cap CPU utilization in a cpufreq-aware manner, if they don’t already exist (there’s definitely CPU usage caps but I don’t know if they resolve to lower frequencies or do something less smart).
    
    LikeLiked by 1 person
    - 4
      
      hoakley on January 3, 2022 at 5:59 pm
      
      Thank you, and thank you for cross-posting here.
      I have looked at high QoS and the P and E clusters too. Recruitment occurs by cluster, so with 1-4 processes on an M1 Pro the first P cluster is bumped up to maximum frequency with a total active residency of 100-400% for the whole cluster. A fifth process then recruits the second P cluster, which adds similarly up to all 8 P cores at 100% per core and maximum frequency. Adding the ninth and tenth process then recruits the E core cluster. Although I haven’t looked at frequencies for 9 processes, at 10 both E cores are also at full blast. So it’s quite a systematic managed strategy, when viewed using these in-core loads. Of course in conventional benchmarks, you’ll see all sorts of other effects from reordering, memory access, and so on, which is why I’ve kept to my simple testing.
      Maybe I should also try modelling some of the macOS strategies myself in Linux?
      Howard.
      
      LikeLike
- 5
  
  hoakley on January 3, 2022 at 12:15 pm
  
  While I’m replying to you, Hector, following your previous help and going back to the drawing board, I have a three-part series on the Secure Boot process starting tomorrow. If you can spare the time, I’d be very grateful if you’d correct me where I’ve made any errors.
  Thank you for your help and involvement, and all the info you make available.
  Howard.
  
  LikeLike
  - 6
    
    marcan42 on January 3, 2022 at 1:54 pm
    
    I’ll be happy to take a look. You might want to reference our documentation; I can’t claim it is absolutely accurate to the letter (and any discrepancies with the PSG should be checked), but I think it will provide useful context:
    
    https://github.com/AsahiLinux/docs/wiki/M1-vs.-PC-Boot
    https://github.com/AsahiLinux/docs/wiki/Glossary
    https://github.com/AsahiLinux/docs/wiki/SW%3ABoot
    https://github.com/AsahiLinux/docs/wiki/SW%3AStorage
    
    Note that I call LLB “iBoot1”; there’s been a bit of contention over what the right name is. Apple likes to call it LLB in the PSG, but my understanding is that is iPhone terminology and on Macs it is more properly called iBoot1 (the PSG applies to both, so maybe they just kept the LLB name there). The actual terminology used in macOS is inconsistent, e.g. the BuildManifest talks about “iBootStage1” while the actual files are called LLB.*. In any case that’s just a terminology nitpick.
    
    LikeLiked by 1 person
    - 7
      
      hoakley on January 3, 2022 at 5:48 pm
      
      Thank you.
      I have already cited and acknowledged your documentation, and stolen from it liberally!
      I’ve mentioned the synonyms, but stick to the terms Boot ROM – LLB – iBoot, as I think for the general reader they’re the clearest. I also draw attention to the Sealed/Signed terminology used in the SSV – the APFS reference manual doesn’t mention signing, only sealing, and when the seal is broken we don’t mention that the signature is also broken, which could also be confusing.
      Howard.
      
      LikeLike
8

Edwin T on January 5, 2022 at 5:39 pm

Thanks Howard, for this very informative series of articles. Given the E-core behaviour you’ve reported, I am left to wonder about the decision to equip the original M1 with a cluster of four E-cores to begin with.

Now I understand that power consumption often scales faster-than-linear with respect to operating frequency (to simplify a complicated subject quite a bit), and your measurements taken at face value appear to bear this out. I’ve generally understood this to be the reason for having lower-clocked cores (but more of it) in power-constrained designs and especially in the efficiency cores of big.LITTLE configurations.

My confusion is with the actual measured magnitudes of E-core cluster power draw here. Sure, the additional ~50mW power draw in the M1 Pro relative to original M1 is an increase of ~30%. But a comparably large reduction in battery life can manifest *only* when that 160-210mW is the dominant source of system power consumption. For context, in an M1 MacBook Air with a 50Wh battery, that is like saying switching from a 4-core to a higher-clocked 2-core cluster reduced your battery life by 56 hours (from 294hr to 238hr). This is an unrealistic regime that nobody really cares about.

In a more realistic context, suppose we are looking at a total system power draw of ~2.8W. This yields a ~18 hour battery life quoted by Apple for the same M1 MacBook Air. Now if we assume the E-core cluster runs full-tilt throughout that 18 hours (to overestimate the effect on battery life here), then going from 2.8W to 2.85W (again by switching to a higher-clocked 2-core efficiency cluster) reduces battery life by a mere 19 minutes. In cases where the E-core cluster *isn’t* running full-tilt all the time, or where system power is increased e.g. by P-cores being active, the delta is likely to be even smaller than that.

So I’m left wondering why waste precious silicon real-estate in the original M1, if a 2-core cluster of higher-clocked Icestorm cores incurs a mere 50mW additional power draw with few real-world downsides.

-Edwin-

LikeLiked by 1 person
- 9
  
  hoakley on January 5, 2022 at 10:29 pm
  
  Thank you.
  It’s very important to remember that the E cores in all variants of M1 chip have the same maximum frequency, of just over 2000 MHz, and that this strategy isn’t baked into the chip, but in its management by macOS. So at any time in the future, even dynamically perhaps, Apple can change that strategy.
  I don’t think that the extra E cores in the original M1 are a waste of silicon at all. Recall that chip only has 4 P cores, and is primarily aimed at a market with lower performance expectations which, at least in the notebook models, wants to eke out battery power as long as possible, and avoid the need for active cooling. The four E cores there aren’t just to support macOS background services, but as supplements to the P cores, and are recruited whenever high QoS processes have pushed active residency of each of those four P cores to 100%. They then provide a valuable low-energy way of supplementing those P cores without demanding too much from the battery.
  The power estimates given here by powermetrics are only approximate, and are those for the cores themselves, not the whole chip. But an original M1 fully loaded with processes will consume around 4.2 W, whereas an M1 Pro will use almost double that, albeit with nearly twice the instruction throughput in return. I think that’s Apple’s design intention.
  Howard.
  
  LikeLike
10

Mark on March 16, 2022 at 12:26 pm

Using asitop on my Mac mini M1 2020 (16GB, 8C GPU) the E-core´s frequency goes up to 2GHz. Didn’t you wrote something that in the M1 E-core max frequency is 1GHz?

BTW: very interesting articles – thanks for your effort!

LikeLiked by 1 person
- 11
  
  hoakley on March 16, 2022 at 1:48 pm
  
  Thank you.
  No, as I wrote above, maximum frequency of E cores is 2064 MHz. If you want good measurements of core frequencies, then powermetrics takes some beating.
  Howard.
  
  LikeLike
12

Ken on March 20, 2022 at 10:20 pm

Hi, what are you using for the charts?

LikeLiked by 1 person
- 13
  
  hoakley on March 20, 2022 at 10:21 pm
  
  Datagraph from the App Store.
  Howard
  
  LikeLike

·Comments are closed.

Share this:

Related