hoakley November 17, 2021 Macs, Technology

How can you compare the performance of M1 chips? 1 Geekbench

One of the first things you want to know about any new processor or chip with processor cores is its performance. Is it faster than equivalent processors made by Intel or AMD, and is an M1 Pro faster than the original M1? Over the last year, I’ve been looking at different ways of measuring this for Apple’s M1 chips, and this article and its sequels summarises some of the lessons so far.

My starting point is running widely used benchmarks in Geekbench 5 on the 8-Core Intel Xeon W processor in my iMac Pro. Here’s what I see in Activity Monitor’s CPU History window for a typical test run.

bmsgeekbench01

In each of these CPU History windows, time passes from left (oldest) to right (newest) for each of the panels, with red representing system load and green the app load. In this case, Geekbench ‘single core’ tests were run for the period starting about a third of the way across each panel, then the ‘multi-core’ tests cut in just after half way, and are reflected on all the cores, until they complete and load drops to almost zero. Being an Intel CPU, the cores on the left with odd numbers are ‘real’, and those with even numbers on the right are virtual cores achieved in Hyper-Threading.

In fact the ‘single core’ tests are distributed across all eight cores, but look as if their total represents something approaching 100% load on a single core, confirmed by the figure given in Activity Monitor’s main window. The ‘multi-core’ tests only attain 100% briefly on all cores, but average well over 50% throughout, and were sufficient to bring the iMac’s fans up to speed. Load distribution is also fairly even and follows a similar pattern on each core shown.

My conclusion is that the resulting benchmark doesn’t fully assess the capacity of all eight cores, but it’s probably not far off.

When Geekbench 5 runs the same CPU tests on my M1 Mac mini, the picture is quite different.

bmsgeekbench02

The single-core tests are run on just two of the Performance (P) cores, where they seldom reach a total of 100% load, but exceed 50% much of the time. While the multi-core tests do load all eight of the cores, they only reach 100% for brief periods at the start and end of the tests, and for much of the time barely reach 50%, although they’re spread evenly, on P and E cores.

Try that on an M1 Pro running on mains power, and the problems are even more apparent.

M1Progeekbench

Single-core tests are distributed across the first cluster of four P cores, and probably amount to a total of significantly less than 100%. The multi-core tests, though, never reach 100% on any of the ten cores, and much of the time fall well short of 50%, although they appear similar in pattern and evenly balanced across the cores, including the E cores.

If we expect a CPU benchmark to reflect maximum capacity of the cores to take load, there’s a wide gulf between the results on the Intel Xeon and Apple’s M1 chips. There are, of course, a host of reasons which could account for this, from inefficient code generation for the ARM cores to inaccuracies in Activity Monitor. Unfortunately, it’s extremely hard to assess why this occurs.

Assuming that the Geekbench performance figures are linear, with twice the performance being reflected as twice the figure (as claimed by Primate Labs), one way to get a better idea is to run multiple copies of the tests to reach the target 100% load. When I first tested my M1 Pro, it returned a result of 1772 for single core, and 12548 multi-core even though none of those tests came close to using 100% of any of its cores. When two copies of Geekbench 5 were run at the same time, started within a couple of seconds of one another, the single core score remained unchanged, and the two multi-core scores were 9828 and 8845, a total of 18,673.

bmsgeekbench03

During the initial single core tests, total load exceeded 100% across all four cores in the first cluster. When the multi-core tests were running, 100% was reached for substantial periods at the start and end of that phase, and in between load was well over 50%.

The final test in this series was to run three copies of Geekbench simultaneously, which returned single core scores of 1682-1717, only slightly lower than for a single run, and multi-core scores of 7162, 7061 and 6428, totalling 20,651.

bmsgeekbench04

The CPU history shows much fuller load on the cores during the multi-core testing, although even then load wasn’t sustained at 100% throughout.

This isn’t a claim that the Geekbench score for an M1 Pro should be raised to over 20,000, but it suggests that, if these benchmarks were able to make fuller use of the cores in the M1 Pro, they’d be more likely to deliver a score of over 18,000. That relies on such high loading being possible, which also needs demonstration.

bmsgeekbench05

My last CPU History for today doesn’t rely on Geekbench, but on some test loads which I’ve been developing in my own app AsmAttic. Each of these tests is a mixed benchmark consisting of integer and floating point operations run millions of times in a tight loop. For the first half of this chart, the P cores were loaded with one copy of the task, which was run fairly evenly across the four cores in the first cluster, with the E cores and the second cluster of P cores largely inactive.

Just after half way, when the P cores had completed that initial task, the two E cores were loaded successively with two copies of the same iterative task, so that with both copies running they reached 100% load. Towards the end of that, I loaded the P cores with multiple copies of the same task, bringing the first cluster to 100%. In the final phase, I loaded eight copies of the same task onto the P cores, and managed then to achieve 100% load across all the cores in both clusters. Not only is it possible to attain 100% core loads using these synthetic tasks, but this can also be seen in real-world apps, for instance when using AppleArchive for compression.

What’s also interesting here is that, despite the great variation in loading of the cores, when run on the P cores 10^8 iterations of the test took 14.2 to 18.9 seconds, quite a tight range considering the differences in total core load during execution.

My next step is to use synthetic loads to compare different M1 chips, and different conditions, including power options, which I’ll describe in the next article.

15Comments

Add yours

1

Piotr Tobolski on November 17, 2021 at 11:47 am

I wonder if Geekbench is not linar or scores for single- and multi-core are calculated differently. If single performance core can achieve 1750 points and all cores achieve over 18000 it would mean that this processor should have over 10 performance cores but it doesn’t. With zero throttling and linear scoring I would expect multi-core score of about 16000 considering that 2 cores are efficiency with about 50% performance of the faster cores.

LikeLiked by 1 person
- 2
  
  hoakley on November 17, 2021 at 12:18 pm
  
  Thank you.
  The Geekbench docs are quite explicit in claiming to be linear, but that must assume that they fully load all available cores.
  Note what I wrote about the single-core test: “Single-core tests are distributed across the first cluster of four P cores, and probably amount to a total of significantly less than 100%.” So, if the 1750 figure is an underestimate, multiplying it by the number of cores only multiplies that error.
  Also note: “This isn’t a claim that the Geekbench score for an M1 Pro should be raised to over 20,000, but it suggests that, if these benchmarks were able to make fuller use of the cores in the M1 Pro, they’d be more likely to deliver a score of over 18,000.”
  What I think the results show is that, while Geekbench has its merits, it isn’t an accurate measure of the pure CPU performance of the M1 series chips. The evidence suggests that it underestimates their performance, both in single- and multi-core tests. It might equally underestimate the performance of other processors too, but comparison here between the Intel Xeon and M1 Pro doesn’t seem equal, and appears to disadvantage the M1 Pro.
  I’m also very wary of making assumptions about the relative performance of the P and E cores, and between the original M1 and M1 Pro. My purpose in measuring is to avoid those assumptions, and be led by the results.
  I admit I’ve also cheated a little, because I know the results already, although I’m running further tests to confirm them!
  Howard.
  
  LikeLike
3

William D Schwaderer on November 17, 2021 at 2:56 pm

Thank you for another great post.

My M1 Max arrived two days ago and I have produced a couple of videos. Normally, I have performance monitor running on a second screen but there presently isn’t one with it so I neglected to bring it up.

Looks like this morning is just the right time….

Thanks again.

LikeLiked by 1 person
4

Gregor Brandt on November 17, 2021 at 3:59 pm

Although multicore Geekbench appears to hit 100% load on the intel chips, it would be an interesting comparison to see two or three Geekbench run simultaneously on the Intel chip to see if we get an increase in additive scores.

LikeLiked by 1 person
- 5
  
  hoakley on November 17, 2021 at 5:01 pm
  
  Thank you. Yes, that’s a very good point, although not so easy for me to test, as that iMac Pro is my production Mac, so runs all sorts of other overhead. However:
  Intel 8-Core Xeon one app scores 1108, 7832
  Intel 8-Core Xeon two app scores 1090, 1095; 5042, 4520; total multi-core 9562, which is 122% of single-app.
  M1 Pro (figures above) two app multi-core was 149% of the one app score.
  So, as might be expected, the Intel score does rise when two sets of Geekbench BMs are run simultaneously, but the rise for an M1 Pro is considerably greater.
  Do we score for fans too? :)
  Howard.
  
  LikeLike
  - 6
    
    Gregor Brandt on November 17, 2021 at 6:04 pm
    
    I expected some rise, but 22% is higher than I expected given the CPU usage seems near 100%. Interesting. Yes we should score on fan speed as well :-)
    
    LikeLiked by 1 person
    - 7
      
      hoakley on November 17, 2021 at 8:15 pm
      
      Thank you.
      John Poole, developer of Geekbench, has suggested that some of this may be the result of the one-second breaks between each of the tests, although I think that should be roughly equal in its effect here, and small in the context of the whole series. He’s got (at least) an M1 Max and is getting deeply into it!
      I should perhaps make clear that none of this is intended as criticism of Geekbench, which I continue to use and love, and recommend without reservation. But it’s raw core performance that I’m most interested in, and loading the cores fully is an important step there.
      Howard.
      
      LikeLike
8

Javier Gallardo on November 17, 2021 at 7:37 pm

Waiting eagerly for your power options analysis.
M1 (and Pro, Max) performance is quite interesting, but I feel a big part in the innovation in these CPUs (and/or SOCS) is performance per power consumed. I’ve heard there’s no throttling when working on battery, which is a big thing. Wattage consumed depending on cpu load would be interesting, and I believe it to be perhaps biggest M1’s merit.

LikeLiked by 1 person
- 9
  
  hoakley on November 17, 2021 at 8:17 pm
  
  Thank you.
  Yes, in notebooks in particular I agree. But when on mains power and in desktop models, energy consumption may be in second place for many users. Apple even provides a High Power Mode for the M1 Max.
  Howard.
  
  LikeLike
  - 10
    
    Javier Gallardo on November 18, 2021 at 8:11 am
    
    Oh.
    I thought M1 could be used in server farms or industrial/corporative works, 24/365, in a near future. In this cases, power consumed is relevant.
    I was pondering M1 possible success.
    
    LikeLiked by 1 person
    - 11
      
      hoakley on November 18, 2021 at 10:31 am
      
      Thank you.
      Yes, I’m sure that Apple could develop a version in the M1 series which would be good for servers. There are other manufacturers who are already selling ARM-based servers, which seem to perform well. However, you may remember all Apple’s problems with the Xserve: it’s a specialist market, and without a server OS, I doubt whether they’d have much success.
      It will be interesting to see.
      Howard.
      
      LikeLike
12

name99 on November 17, 2021 at 8:44 pm

Howard, I think you are misinterpreting what you are seeing here, because the granularity of Performance Monitor is so much lower than the granularity of timing appropriate to benchmarks.
Performance Monitor does not record some sort of “how many instructions per second is the CPU performing, compared to peak potential” — that info is available, via the Performance Counters, but it’s not what Performance Monitor is showing you.
What Performance Monitor is showing you is what fraction of each second or so a core is either active (as viewed by the OS) or paused waiting for code to be dispatched to it.

GB5 deliberately takes pauses between each individual benchmark (and, for all I know, between repeated runs of a given benchmark). This is deliberate in the sense that GB5 wants to provide something like a “best possible value” for each CPU, so it allows a brief window after each round of exertion for the CPU to cool down, via some sort of sleep() API. (This has become more important on x86 cores with their extreme turbo-ing modes, and could in principle be important for mobile though in practice it seems not to be.)
And of course the benchmark developers are not fools. Timing and “amount of work done” values are clearly paused during these periods when the code sleeps.

This is what you are seeing in PerfMon, and why the code appears to bounce between CPUs — after each sleep, there’s no strong reason for the code to run on the previous CPU (and again all the OS’s now, on x86, seem to prefer bouncing code between cores on thermal inertia grounds; for M1P/M Apple seems to prefer to stick to the same core — L1 affiinity — but bounce between two clusters — larger effective L2)?

Many other benchmarks, especially commercial benchmarks, do the same sort of thing. But I can assure you that if you run “understanding” benchmarks rather than commercial benchmarks (by this I mean the sort of benchmarks run by people like myself or AnandTech, where the goal is to understand the CPU, not to prove some point) you get behavior as expected. For example right now I am running on my MBA my benchmark for testing a variety of latencies in the CPU, a benchmark that runs at 100% of a single CPU for many minutes.
In Perfmon the pattern is that each P CPU mostly show a single bar at 100% for a second or so, then nothing for about 3 seconds, but there are plenty of cases where two CPUs both show a 50% bar. Clearly (for whatever reason) the CPU is moving the code between P CPUs, and on a schedule that is not synchronized with the PerfMon display) so that you could imagine that there is a whole lot of performance available to a core not being tapped; but that just just isn’t so.

One can argue about what GB5 (or any other benchmark) “should” do, but that’s usually a tribalism argument. A better question is “what is this benchmark designed to show?” GB5 is designed to show the peak possible performance for a chip given “reasonable” cooldown periods between short spurts of work, in other words how many chips (on PCs and phones) are used by most people, and to do that for both single-threaded type code and multi-threaded type code. And it does that pretty well and fairly accurately.

A different type of benchmark would ask “what happens if I max out one CPU [or all CPUs] and run them for a long time without any pausing?” Anandtech’s SPEC benchmarks (Rate_1 and Rate_N) do that.
In a sense your AsmAttic tests are showing the same thing. If you load every core (ie you provide 8+ high-QoS threads ready to run) then you will see 100% loading of the P cores in PerfMon. Again this tells us only what the OS sees — it does not, eg, tell us if those cores are mainly waiting for DRAM, or mainly recovering from mis-predicted branches, or all the other ways that a CPU can perform at substantially less than 8 instructions per cycle.

Yet a third type of benchmark is no longer interested in the user experience, asks very focussed questions like “what is the TLB hierachy across the SoC”, and runs very carefully constructed (and completely artificial loops, often with zero relationship to any sort of normal, realistic code) to answer such questions. That’s most of what people like myself and Dougall J have been doing.

In all these case the PerfMon app is a terrible tool from which to make any sort of judgements. It can tell us interesting things about how the OS is choosing to schedule code (in particular the scheduling algorithms on Pro and Max seem to be rather different from on M1), but those are OS investigations, not CPU investigations.
If you want to play with this sort of thing in AsmAttic, you should use the Performance Counters. The API is easy to use, but has been wrapped up into something even easier by Daniel Lemire and colleagues:
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2021/03/24/m1cycles.cpp
His wrapper code is in C++ (the basic Apple API’s are in C) and you might find it a fun project to create a similar wrapper in Swift.
Obviously the immediate baselines everyone wants are cycle counts are instructions retired, but after that you can start counting things like SIMD instructions, branch mispredictions, and all that jazz.

Early in the M1 most of the counters were not available to the public, only about ten or so with names in the header). The current belief is that all the counters (probably a hundred or more) are accessible to developers, but their explanations are not yet given.
Dougall has made some preliminary guesses, as you see if you look into the details on his M1 instruction timing pages. I have modified some of his guesses, but mostly this side of things (ie the precise meanings of the non-publicly-described counters) remains very much a work in progress!

LikeLiked by 2 people
- 13
  
  hoakley on November 17, 2021 at 9:50 pm
  
  Thank you, Maynard.
  “What Performance Monitor is showing you is what fraction of each second or so a core is either active (as viewed by the OS) or paused waiting for code to be dispatched to it.”
  That’s exactly my point. If a benchmark runs at less than 50% on that core, that means that for most of the time, that core is paused, for whatever of many reasons. That’s fine when you’re benchmarking predefined tasks, such as encryption or compression, which is what Geekbench is designed to do, and does well. As you well know, what you then need to ask yourself is how close performance is on those tasks to compared with those which you’re going to use that chip for.
  That’s not what I’m interested in. I’m interested in looking at, and comparing, performance across the P and E cores in the original and Pro/Max chips when they’re fully active, i.e. close to 100%.
  Some of that, of course, is managed for the cores. One simple observation which I find fascinating is that, when running tight code loops which only access registers, at high QoS, given processes don’t run as fast as they could to completion. These sub-maximal core-intensive tasks usually take the same time to complete regardless of whether the cores are busy, or almost entirely idle. While I’m sure that’s very familiar to you, I (and I suspect most who read this blog) make the assumption that an M1 runs high QoS processes as fast as it can most of the time, certainly when not constrained by power or thermal limits.
  I’m using Activity Monitor’s CPU History window to (a) see which cores the tasks are running on, which I can’t find anywhere else unless you can suggest a better way, and (b) to see whether each core is fully/partially active at that time, which also doesn’t appear readily accessible another way. When I can then change core load under my control and environmental controls such as battery/lowpower/mains, I can get a better idea of differences in the same task performance.
  For example, it’s assumed I think that each E core is across the board around half a P core. In which case, you’d expect NEON code to run at about half the speed on an E core as on a P core, and similarly for floating point and integer. Is that fair? Because that’s not what I see when I run my on-core tests, where the ratios are different.
  Currently I have a brief window between writing engagements, which gives me a little time to look at this again. I’m amused that you think that writing wrappers for C++ code in Swift would be a “fun project” before I resume writing Q&A on Sunday. It’s bad enough getting such wrappers to work for assembly code, and something that I have said I will return to when I have the intestinal fortitude. In the meantime, for these purposes, Mach ticks may not be as high-resolution, but they serve my purpose perfectly well, and my code to access them is well-proven and as accurate as the ticks (which is, of course, not as fine-granular on ARM64 as Intel, but more than adequate when your tests take seconds rather than ns).
  While painstaking instruction timing work is of enormous value and accomplishment, in my recreational projects it doesn’t really answer the same questions. I know this is pretty amateur stuff, but it’s more closely related to answering frequent questions such as why tasks are completed slowly when total CPU load is very low, which is the sort of thing that ordinary users wonder. And that is a striking feature of the M1 series Macs.
  So maybe what you think I’m trying to do is rather different from what I am trying to do, in my clumsy and inept way?
  Howard.
  
  LikeLiked by 1 person
14

name99 on November 17, 2021 at 11:08 pm

“These sub-maximal core-intensive tasks usually take the same time to complete regardless of whether the cores are busy, or almost entirely idle. ”

I do not know how to parse this. I don’t know if you’re just working with a different mental model.
If a core is running code it is busy. If it’s not running code it is idle. There’s no “almost idle but still running code”.

Likewise “. One simple observation which I find fascinating is that, when running tight code loops which only access registers, at high QoS, given processes don’t run as fast as they could to completion”
How do you know how much time they should take, to say they don’t run as fast as they should?

I can’t comment further because I don’t know your methodology. I get the feeling that you are doing something like
“run benchmark code on one core
run other stuff on other cores
see what happens”
and then are surprised by the result, but with no details beyond that, I’ve no idea whether the surprise is warranted or not.

My experience has been that if I lock code to a P-core and the code behaves like P-core-appropriate code (ie it’s a tight benchmark, not spending most of its time waiting on IO or sleeping or suchlike) it will stay locked to the P-core and behave as I expect.

But for code like compression, well, what’s the investigatory goal?
To analyze the P-cores, you could create a block of data to be compressed in a GB or two of DRAM, then run the compression code over it a few times. Pure CPU (and DRAM…) limited, easy to understand, easy to instrument with Performance Counters.
But if you’re talking about a full compression app, with lots of IO and some UI, then is it unreasonable, or even unexpected, for the OS to schedule much or all of the work on E-cores?

This is not an unreasonable thing to do, but it’s a different type of investigation, with different types of conclusions, and does not tell us much about the CPUs.

eg something like “For example, it’s assumed I think that each E core is across the board around half a P core. In which case, you’d expect NEON code to run at about half the speed on an E core as on a P core, and similarly for floating point and integer. ” is a terrible starting point for analyzing anything.

Generically an E core is more like 1/3 of a P core. (More like 22% for FP, see the summary graphs here:
https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/3)

In terms of FP “direct” hardware
– an E core has 2 FP units vs 4 for a P core. So 50% right?
– but an E core has max GHz of about 2/3 a P core. So we’re down to 33% right?
– but an E core has much less of all that prediction and OoO machinery that allows a P core to go fast on unstructured code. If you write a certain type of loop, most of that doesn’t matter, and you can get 33% FP performance out of your E core. A different type of loop really needs that OoO machinery and does much worse.
It could even go the other way — a particularly terrible (and mostly very unrealistic) type of loop that is completely dominated by random DRAM access could possibly run at much the same speed on either an E or a P-core!

– finally we have the OS which, if concludes your app spends most of its time waiting for IO, will likely give you an E core running not a 2GHz but at 1GHz or lower.

So when one see’s that an E core is running code much slower than a P-core, to conclude that its unexpected
– one need to get the expected performance ratio correct (substantially dependent on the code, but for “a reasonable variety” of code, as opposed to a particular tight loop, 30% is a reasonable expectation for int, 20% a reasonable expectation for FP.
– and one has to have an idea of whether the OS considers this to even be performance-relevant code (and so is running the E core at max GHz) as opposed to “mostly waiting around” type code to be run at low GHz.

Perhaps a different way to look at this is to consider, for different types of code, when the OS makes DVFS decisions.
In terms of “what does a naive user see”, you could compare code that is all marked at the same QoS level, but which either runs a single long computation over many seconds, or breaks up the computation frequently with calls like
– deliberate sleep() or
– IO (network, file) or
– user interaction or
– waiting on another app.

For example if a thread makes one IO call, does that immediately ghetto’ize the thread to an E core at low-frequency? And if so, for how long? If not, is there a threshold? At some density of IO the OS must surely conclude that this thread is not appropriate for a P core!
If that IO call (and a transition to E) is followed by ten seconds of pure compute, can one see the point at which the thread is moved back to a P core?
PerfMon might still not be the best tool for this compared to making API calls within the app (may not have fine enough timing resolution), but it seems like the sort of fun project that would build on your investigations so far, and whose results would be very interesting to report!

And don’t feel bad! We’re all amateurs in this, just with different expertise. I’m comfortable with the low-level CPU details, but rely on people like you to help me with, or explain to me, an endless sequence of OS-level issues like some security nonsense, or backups. We’re all ignoramuses about 95% of everything :-(

LikeLiked by 2 people
- 15
  
  hoakley on November 18, 2021 at 7:00 am
  
  Thank you, Maynard. I’m really sorry, but without dropping everything else I don’t currently have time to respond to all the questions/issues you raise, but I will get back to you when I do have time. I greatly appreciate your thoughts, and apologise for being so thick. At least you find a few of my articles of limited use.
  Howard.
  
  LikeLiked by 2 people