hoakley December 7, 2021 Macs, Technology

Comparing performance of M1 chips: 4 Icestorm

So far in this series looking at comparing the performance of the cores in M1 series chips, my attention has largely been devoted to processes running at high Quality of Service (QoS), resulting in them being scheduled first and foremost on the Firestorm performance (P) cores. Only when they are fully loaded does macOS schedule processes with high QoS on the Icestorm efficiency (E) cores. This article looks at the opposite end: processes which are given the lowest QoS, to be run as background services. Those include most of the everyday services in macOS, together with any designated by third-party software.

It was last May that I reported that using the lowest setting for QoS resulting in processes running exclusively on the E cores in an M1 Mac. Since then we’ve gained M1 Pro/Max chips which behave qualitatively the same. Using the same tests that I ran at high QoS in my last article in this series, I can now provide a quantitative comparison between both the core types in the original M1 and new M1 Pro chips. For full details of the tests and methods used, see that last article.

The only difference in the methods used to examine the E cores is that the maximum number of processes is limited to 5 on the original M1, with its 4 E cores, and 3 on the M1 Pro, with its 2 E cores. Numbers of test loops were also adjusted to cope with the different performance of the E cores.

All tests run at minimum QoS were confined to the E cores as expected. In this sequence on my M1 Mac mini, the number of processes increases from 1 at the left to 4, 6 and 8 at the right. As noted before, rather than cores being recruited individually, for the whole series the load is shared fairly evenly across all the cores in the E cluster, and there’s no use of the P cores at all.

Here’s a similar sequence of 1-4 processes on the M1 Pro. Although the CPU % for the single process (left) was given as 100%, it was shared evenly over the two E cores, and the eight P cores remained near idle throughout. There’s one tantalising detail which you might not otherwise notice: the width (duration) of the second test with 2 processes is obviously smaller than the first test with a single process. Similarly, the peak of the third test with 3 processes is narrower than the first test.

Results

I now have measurements of loop performance for the same five tests running exclusively on 1-8 P and 1-4 E cores on the two chips. To compare these here, I show graphs of the loop performance, expressed as billion (10^9) loops/second, against the number of processes, which is the same as the number of cores (although in some cases the total is shared across a greater number of physical cores than processes).

Taking the Integer test first, this graph combines the measurements made on the P cores, shown with a solid linear regression line, with those on the E cores, with the broken regression line. Data points from the original M1 are shown using +, and those from the M1 Pro as x.

Performance for the P cores is identical between the M1 and M1 Pro, and reflects the significantly higher performance of those cores. Integer performance of the E cores of the M1 is considerably lower, as expected given its ‘half P’ design and lower clock speed. There is, though, one obvious outlier, and that’s the performance achieved by two processes running on the M1 Pro, which is about double that of the M1, although it’s still significantly lower than a single process on a P core.

Results for the Floating-point test are similar in pattern, again with points which are tight on the regression lines. This time though the outlier of 2 processes on the M1 Pro E cores is more than twice that on the M1, and higher than a single process on the P cores.

NEON performance is very similar to floating-point, with the M1 Pro outlier.

Accelerate also behaves similarly, but this time the M1 Pro’s performance with 2 E cores is almost the same as on a single P core.

I remarked in my last article that my final ‘mixed’ test didn’t behave as well as the others when used on the P cores, and that’s reflected in the scatter of data points, particularly with 5 processes and more. Despite that, it shows a similar pattern, with the M1 Pro two-process outlier closer to that seen in the Integer test.

Equations for each of the regression lines shown above are given in the Appendix below.

E core performance

There’s a clear difference in behaviour between the E cores on the original M1 and M1 Pro chips. With four E cores available, macOS runs them at performance substantially lower than the P cores when given high QoS processes. With only two E cores, when they’re loaded with two processes, macOS increases their performance to a level which exceeds that of all four E cores in the M1, even though the processes are run at the same minimum QoS.

Two major determinants of the relative performance of these cores are their architecture and clock speed. Work by Dougall Johnson, Maynard Handley and others has led to the proposal that an E core is essentially half a P core, so at the same clock speed you’d expect a P core to run these tests in roughly half the time as a single E core. Using that equivalence and the gradients of the regression lines in the previous graphs, it’s possible to derive an ‘equivalent clock speed’ for the E cores, relative to that of the P cores under a high QoS load which should be about 3.2 GHz.

This table gives the gradients for the solid lines (P) and broken lines (E), which is the performance increment achieved by each added core of that type. The value of the performance of both E cores for the M1 Pro, the outlier points on each graph, is given in the third column. The effective clock speeds for the E cores on the M1 chip are then calculated from the E and P gradients, and that required to achieve the outlier, as the maximum ‘effective clock speed’ to explain that result. Note that the maximum clock speed reported for the E cores is just under 2.1 GHz, and any effective clock speeds above that aren’t likely to be actual clock speeds!

When loaded with processes at low QoS, the four E cores of the original M1 appear to be run at a clock speed of 0.8-1.6 GHz, well below their maximum of 2.1 GHz. When the two E cores of the M1 Pro are loaded with a single process at low QoS, they too run at those reduced clock speeds. But a second process at low QoS on the M1 Pro results in an increased clock speed which boosts performance.

The performance improvement seen on the M1 Pro’s two E cores varies according to the type of load. It’s lowest with code running in the ALU throughout, where it appears close to that expected from a real clock speed of 2.1 GHz. However, in code running predominantly in the Floating-point/NEON units, the effective clock speed is higher than the maximum of the P cores, requiring additional explanation.

It’s possible, although unlikely, that P core results for both the M1 and M1 Pro were attained with all the P cores running at less than their maximum clock speed of 3.2 GHz, despite those processes being given the maximum QoS throughout. If that were the case, then it would make performance testing extremely unreliable.

Conclusions

macOS manages background processes with minimum QoS quite differently on the original M1 and the new M1 Pro chips. On both chips, all such processes are confined to the E cores, even when there are only two E cores and all the P cores are essentially inactive. Background process are evenly loaded across all the E cores available in the cluster. A single process is then run with an effective clock speed of about half the maximum of the E core, in both the M1 and M1 Pro.

On the four E cores of the original M1, additional processes don’t appear to change the clock speed, so the difference in performance between E and P cores widens progressively.

In contrast, macOS manages the two E cores in the M1 Pro quite differently. When a second process is added, the performance of the cores increases as if their clock speed is increased, probably to their maximum of 2.1 GHz. However, additional performance improvements are seen in code predominantly run in the Floating-point/NEON units which doesn’t appear to be explained on the basis of clock speed alone. The overall effect is that running some processes on both E cores can be faster than on a single P core, with its higher maximum clock speed.

This suggests that macOS currently – in macOS 12.0.1 – adopts different strategies for these two M1 series chips. Even when used in a desktop Mac, the original M1 E cores appear to be managed primarily for energy efficiency. In the face of increasing load from background processes, they run those slowly. Only when processes are given a high QoS are the E cores run at higher clock speeds.

With half the number of E cores, macOS manages the M1 Pro chip more aggressively, increasing the performance of the E cores when they’re loaded with a second process. Looking back at the CPU History for three processes, above, once the load has fallen to that of a single process, the clock speed and performance appear to reduce again.

This emphasises again how assessing the performance of M1 series chips is much more complicated than conventional processors with homogenous cores.

I eagerly look forward to someone who knows what they’re talking about explain the performance boosts seen in Floating-point/NEON code.

Appendix: Equations of regression lines shown in graphs

Integer

E 0.0018909 + 0.029141x
P 0.014345 + 0.24511x

Floating-point

E 0.0028444 + 0.033277x
P 0.0083001 + 0.15025x

NEON

E 0.019773 + 0.22343x
P 0.053349 + 0.91313x

Accelerate

E 0.01583 + 0.16468x
P 0.05226 + 0.88853x

Mixed

E 0.00011246 + 0.00078972x
P 0.0031558 + 0.0055178x

13Comments

Add yours

1

Simon on December 7, 2021 at 4:28 pm

This is most beautiful analysis, Howard. Combined with your previous articles on the topic, your investigations are clear and straightforward for us to follow, minimizing the unknowns through your simple but clever experimental design. Kudos to you, Sir. It appears there are still some mysteries to unfold when it comes to M1 vs M1 Pro and how similar or dissimilar they perform despite their suspected identical P/E core designs. One day we will all understand much better how these chips work in our Macs and when we do, significant credit for that will go to you and your analysis.

May I submit just one request: in addition to the gradients, would you mind publishing your offsets/threshold values (loop throughput for zero cores) for your various fits? Thank you.

LikeLiked by 1 person
- 2
  
  hoakley on December 7, 2021 at 6:08 pm
  
  Thank you.
  I’m only dabbling gently, but it’s proving fascinating.
  I’ll add the regression equations shortly.
  Howard.
  
  LikeLike
  - 3
    
    Simon on December 7, 2021 at 10:38 pm
    
    Many thanks for the fit coefficients, Howard. Much appreciated.
    
    LikeLiked by 1 person
4

Javier Gallardo on December 7, 2021 at 4:55 pm

My knowledge is too limited to exactly understand all the interesting info you’re revealing, but I think I get an idea (thanks to your clear writing, surely).
I beg your pardon if my question is too simple: I conclude that MacOS is in command when delivering tasks and managing cores. Does hardware take part? Of course, it must, in some level, I presume. But, related to this, how much control has a programmer on this?
To make my question simpler, and pointing to my real interest: should we expect a change and refinement from developers to really fulfill the power in this new CPU? Or it’s this mechanism optimization just on Apple’s side? (Or maybe this new way a processor works is interesting but trivial regarding efficiency profit…?).
Thank you.

LikeLiked by 1 person
- 5
  
  hoakley on December 7, 2021 at 6:14 pm
  
  Thank you.
  We don’t know what is in macOS, and what is in the chip itself, but I suspect that pretty well all the controls I show here are in macOS, which means they can be tuned quite readily.
  The developer has a lot of control, so long as they understand how the system works. Their main control is in setting the QoS, something Apple has been telling us is very important. There are also design issues, with how many processes are used, which will affect performance differently on different M1 chips.
  Plenty of developers are already hard at work trying to get best performance out of the M1, and some are getting excellent results. With experience, many will realise the potential of the chips.
  So, although this isn’t simple, I think in the long run we will all benefit.
  Howard.
  
  LikeLike
6

name99 on December 8, 2021 at 6:48 pm

Howard I’m still not sure quite what your current test code does, given our last discussion.
But essential questions are
– are the various test codes limited by sequential code, by CPU width, or by memory?
In other words, how many operations do you get per cycle compared to what you expect?

The reason this is important is
(a) What does the L2 cache for the E cluster on a Pro and a Max look like? I haven’t seen anyone report this, but one possibility is that the L2 remains the same size as for an M1 (so 4GiB) but shared over two rather than four cores. It could even be larger (maybe layout just allowed for more transistors to be used because of how the various rectangles lined up?)
If your code is mostly being limited by L2 bandwidth this could play a role in the M1 vs Pro/Max performance

(b) Alternatively if you are streaming very long vectors (longer than L2) then you’ll be limited by SLC and DRAM bandwidth. On M1, for P cores
– max read bandwidth from the L1 is ~100GB/s (~32B/cycle)
– from L2 it’s about 85GB/s (about 6/7 of L1 — and the reason for that 6/7 is VERY interesting…)
– from SLC and DRAM it’s about 65GB/s

I don’t know what the E core situation is, but I would guess that the L1 bandwidth is essentially halved, and couldn’t even venture a guess as to what the L2 and SLC bandwidths look like.
But point is we know that on Max and Pro the SLC can feed the L2 (and thus the core) at essentially the full L2 bandwidth (so about 85GB/s for pure reads, higher if you also throw in some writes).
So again we might have that the E core is doing so much better on Max and Pro because it can be fed data from SLC and DRAM so much faster.

I think you’d be able to understand what’s happening a lot better if you characterized all your test code in this way, as essentially what sort of IPC you would expect vs what you are seeing, whether the difference is a result of memory traffic, and traffic to L1, L2, or SLC/DRAM.

I’m writing up another large chapter on these issues but, as always, my desire for thoroughness and testing things from multiple directions means it’s still a work in progress, with at least a month or two before it’s released 😦

LikeLiked by 1 person
- 7
  
  hoakley on December 8, 2021 at 6:57 pm
  
  Thank you.
  If you refer to the source I’ve provided, the Integer test uses compiled Swift on Int vectors of length 4, so should be run entirely from registers. The Floating-point and NEON code is written in assembly language, and only accesses registers. The Accelerate code I strongly suspect calls similar NEON code, and works with 32-bit floating-point vectors of length 4, so should also only access registers. Only the Mixed test may go beyond registers, as that’s compiled into more tortuous code.
  So of those tests, only the Mixed one should go beyond the registers, which is exactly what I intend, and state above.
  So L2 shouldn’t be involved in any of these, apart possibly from the Mixed test. I doubt whether L1 is involved either.
  Howard.
  
  LikeLike
  - 8
    
    name99 on December 8, 2021 at 7:13 pm
    
    But remember, Howard, when I looked at your code last time, I concluded that it did not do what you thought it did (or what I thought it did), and was not identical across use cases.
    For example the Swift code constantly ACCUMULATES the tempA value, the Accelerate code OVERWRITES that value. I don’t know if there are typo’s in your explanatory document, but that was where I gave up last time because of the inconsistency between the different pieces of code.
    
    I’m looking at the code blocks in the ReadMeFirst.pdf of asmattic4
    
    It does look like you are limited purely by compute, no touching memory, which simplifies the analysis. But it also looks like your code is much more serial than realistic code (at least it may be — as I said, I’m not sure the extent to which the documentation displays typos).
    Once we have figured out exactly what the code IS in each case (ideally the ground truth assembly in each case — it’s easy to see — just put a breakpoint in the code, and flip the Debug>Debug Workflow> Always Show Disassembly setting in XCode — for the Accelerate case the code may be inlined, or you may have to step into a function)
    we can go further.
    Specifically (after analyzing how the compiler behaved differently in each case, always interesting!) we can compare an expected cycle per loop iteration vs reality and get a feel for how the clock frequency is being dialed up or down by the E core in different conditions.
    
    LikeLiked by 1 person
    - 9
      
      hoakley on December 8, 2021 at 7:19 pm
      
      If you’d do me the courtesy of clicking on the link in the above article to
      Comparing performance of M1 chips: 3 P and E
      then scrolling to the end of that, you’ll see the full source of each of the tests used. The code run in the Floating-point and NEON tests is exactly that which I wrote in the source code, as you’d expect. We can haggle about exactly what the Integer and Accelerate code does another time, but as that also works with vectors of length 4, I’m at a loss to see how those could possibly reach into L2 cache.
      Howard.
      
      LikeLike
    - 10
      
      name99 on December 8, 2021 at 7:51 pm
      
      Sorry, Howard, this communication channel is really not ideal for either of us!
      
      Great, it looks like the code there is correct (unlike the code in the asmattic PDF I listed).
      Even so showing the resultant assembly would also be nice because the optimization choices used can have important consequences. (I don’t have any experience with this for Swift, but I have seen this have serious consequences for C/C++ microbenchmarks, because Apple’s default options are
      – optimize for size not speed and
      – don’t allow fastmath (ie use of FMA and rearranging computations)
      
      Let me look at the NEON assembly, and think about how it should play out in timing.
      As you said, memory does not seem to be a factor…
      Give me a day or two.
      
      LikeLiked by 1 person
    - 11
      
      hoakley on December 8, 2021 at 7:55 pm
      
      Thank you. I’m once again up to my eyes in writing until Monday, so there’s no rush. I greatly appreciate your interest. I’m following you on Twitter so am happy to DM there if that’s easier.
      Howard
      
      LikeLike
12

Maynard Handley on December 27, 2021 at 9:25 pm

Dammit! I wrote a long comment here about how beautiful this work is, but Safari in its wisdom has eaten it! I’m too tired to repeat it all but salient points are
– I finally understand the point you were grasping for, trying to explain to me, and how weird this is!

– I agree that a higher frequency E-core is not a reasonable hypothesis.

– My best hypothesis is that what Apple is calling E-cores are actually lower-clocked P cores. This sounds like a crazy hypothesis, but is is really?
The die shots Apple have released show the E core as the same size across all three SoCs, eg

Took me many hours more than I expected but here is my die shot interpretation of Apple's M1, M1 Pro and M1 Max.
I… twitter.com/i/web/status/1…—
Locuza (@Locuza_) October 19, 2021

But we know those die shots are somewhat works of art. Compare with a real M1 die shot (unfortunately no real Pro and Max die shots yet exist)

You have to rotate the one to match the other but when you do you see a number of low-level differences (most immediately obvious in the Media Engine block). So it’s certainly plausible that Apple tweaked the Pro and Max die shots to hide various issues they consider “strategic” including, eg the exact size, and even placement, of the E cores…
Even P cores are so small given the size of Max and Pro, maybe the layout allowed for someone to slide two full-sized Ps+ smaller L2 in the space allocated to E-cores, and everyone thought “why not?, as long as we can clock them slow enough that they don’t kill background battery…”

LikeLiked by 1 person
- 13
  
  hoakley on December 27, 2021 at 10:38 pm
  
  Thank you.
  That certainly seems more plausible than overclocking Icestorms!
  I’m sorry I wasn’t clear before, and my methods are old and kludgy, but this is really just quantification of what you see on the CPU History. I’ve looked at this several times now, and feel those anomalous points are an accurate reflection of what’s happening, and there’s not a glimmer of any activity on the P cores.
  Tonight I’m working through the boot process in the log, and reminding myself how iBoot runs largely on a single core, then part-way through fires the other cores up ready for the macOS boot phase. Again, what you see in the Platform Security Guide isn’t exactly misleading, but it paints quite a different picture from what really happens. There’s no shortage of smoke and mirrors.
  Howard.
  
  LikeLike

·Comments are closed.

Share this:

Related