hoakley November 27, 2023 Macs, Technology

Evaluating M3 Pro CPU cores: 1 General performance

Evaluating the performance of CPUs with identical cores is relatively straightforward, and they’re easy to compare using single- and multi-core benchmarks. When there are two different types of core, one designed primarily for energy efficiency (E), the other for maximum performance (P), traditional benchmarks can readily mislead. Multi-core results are dominated by the ratio of P to E cores, and variable frequency confounds further. In this series of articles, I set out to disentangle these when comparing core performance between Apple’s original M1 Pro and its third-generation M3 Pro chips.

This first article explains why and how I am investigating this, and shows overall results for performance and power use under a range of loads.

Why?

Many different factors determine CPU performance, and traditional benchmarks normally try to look at them all simultaneously across a range of computing tasks, drawn from those that might be encountered in ‘normal’ use. Those factors range from the number of core cycles to execute instructions, through core frequency, to cache and main memory access. The tests I use here focus on the core itself, and how rapidly it can execute a tight loop of code that requires no cache or other memory access. Execution rate should therefore be determined by core design, which type of core that thread is run on, and control over the core’s frequency.

Methods

I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Of the four tests reported here, three are written in assembly code, and the fourth calls a highly optimised function in Apple’s Accelerate library from a minimal Swift wrapper. These tests aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run, but simply provide the core with the opportunity to demonstrate how fast it can be run at a given frequency, and how macOS manages core types and determines those core frequencies. Without understanding at this level, interpreting other benchmarks becomes impossible.

The four tests used here are:

64-bit integer arithmetic, including a MADD instruction to multiply and add, a SUBS to subtract, an SDIV to divide, and an ADD;
64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
32-bit 4-lane dot-product vector arithmetic, including FMUL, two FADDP and a FADD instruction;
simd_float4 calculation of the dot-product using simd_dot in the Accelerate library.

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a set Quality of Service (QoS). Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

I normally run tests at either the minimum QoS of 9, or the maximum of 33. The former are constrained by macOS to be run only on E cores, while the latter are run preferentially on P cores, but may be run on E cores when no P core is available. All tests are run with a minimum of other activities on that Mac, although it’s not unusual to see small amounts of background activity on the E cores during test runs.

In addition to the times required to complete execution of each thread, most tests are also run during a period in which powermetrics is collecting measurements from the CPU cores. Those are collected over sampling periods of 0.1 second, typically for 5 seconds in total.

powermetrics returns three measurements used throughout these articles:

Core frequency is given as an average over the collection period. As this is set by macOS for each cluster, frequencies of all cores within any given cluster are the same, even though some may not be actively processing instructions.
Active residency is given for individual cores, and may vary widely between cores in any cluster. This is the percentage of time that core isn’t idle, but is actively processing instructions. In places, I total those individual core values to give a cluster total active residency; for a cluster of six cores, its maximum will thus be 600%. This is the basis for CPU measurements shown in Activity Monitor, which don’t take core frequency into account.
CPU power is an estimate of the average total power used by all the CPU cores together, over the sampling period.

Example results

Because each test thread is a single series of tight loops, each is normally executed on a single CPU core, although some are relocated from one core to another during execution. The following presents the detailed powermetrics results for a single test run, here of four floating point test threads, each consisting of 200 million loops, run at high QoS (33) on a MacBook Pro 16-inch M3 Pro.

m3float4pactres

This chart shows active residency of cores during the four-thread test, with individual P cores shown in black, and the total for the whole E cluster shown in red. The cores were loaded with four threads shortly after 200 ms, and the test was completed before 1400 ms. Test threads were only run on cores numbered 6, 7, 9 and 10, whose active residencies followed the almost identical solid black lines, on the P cores. P8 showed a brief period of high active residency as the threads loaded, as did cores in the E cluster. During the test, the four P cores remained at 100% active residency, so that each thread accounted for all activity on a single P core.

m3float4pfreqpower

This chart shows core frequencies and total CPU power used during the same test, with four threads, each running on a single P core throughout. Frequency of the six P cores rose similarly rapidly as the threads were loaded, and fell when they were complete. The steady maximum P core frequency here was 3624 MHz. E core frequency (red) changed little during this test, with small peaks during thread loading and unloading. Total CPU power use is shown in purple (with open diamond points), and follows P core frequency and active residency, with a plateau of around 3660 mW.

Measured times to run each loading thread were 1.05 seconds, which matches the interval seen between loading and completion here, with all four threads being run concurrently.

Single core performance

The simplest comparison that can be made between M1 Pro and M3 Pro CPU cores is that between single-thread (hence, single-core) loop throughput for each of the four tests. These are shown first for P cores, then for E.

m1m3testspbarcomp

Relative to the throughput measured on a P core in the M1 Pro, P cores in the M3 Pro ran at 130% (integer), 128% (floating point), 167% (NEON) and 163% (Accelerate) throughput. The first two represent a modest improvement that could be attributed to the difference in core frequency. M3 Pro P cores have a maximum frequency of 4056 MHz, 126% of the maximum of 3228 MHz in the M1 Pro. However, while the M1 P core was running at its maximum frequency, that in the M3 was running at 3624 MHz, only 112% that of the M1. In practice, this suggests that the integer and floating point loads did run faster than would be expected on the basis of core frequency alone.

That difference between M1 and M3 is even greater for the NEON and Accelerate tests, where the M3 Pro performs substantially better than the M1 Pro even when allowing for the greatest possible difference in their frequency.

m1m3testsebarcomp

E cores in the M3 Pro also have a higher maximum frequency (2748 MHz) than those in the M1 Pro (2064 MHz), but in practice low QoS threads run significantly slower on the M3, because the M3 runs low QoS threads at only 744 MHz, whereas the M1 Pro runs them at 972 MHz. On the basis of frequency alone, M3 E core throughput would be expected to be 77% that of M1 E cores, which it is for floating point and NEON (both 78%), but the M3’s E cores are slightly worse in Accelerate (71%) and integer (69%) tests.

Thus, on single core performance, the M3’s P core delivers more than you’d expect on frequency alone, particularly for vector computation, but its E cores run more slowly for low QoS threads.

Multi-core performance

It’s time to look at how the M3 runs on more cores, and how P and E cores work together. For this, I collate the results from running just one type of test thread, floating point arithmetic, in different numbers and at high and low QoS.

m1m3pethroughput

This chart shows loop throughput per thread (core) attained by one and more cores on the two chips.

Starting with the E cores alone, and low QoS, shown in the lower pair of lines, the M1 Pro (red) and M3 Pro (black) are completely different. The M1 Pro pulls a trick here: although a single thread running on one E core delivers a throughput slightly greater than that of one thread on one E core on an M3 Pro, there’s a much larger difference when running two threads. That’s because macOS increases the frequency of the two-core E cluster in the M1 Pro from 972 MHz to close to its maximum of 2064 MHz. This appears intended to compensate for the small size of the cluster.

When running three or more threads, the M1 Pro runs out of E cores, and the additional threads have to be queued to be run when one of those E cores is available again. With its six E cores, the M3 Pro plods on more slowly, but doesn’t have to start queueing threads until the seventh, resulting in a fall in throughput.

The P cores, in the upper pair of lines, are more similar. Throughput remains linear in the eight P cores of the M1 Pro (red), up to its total of eight threads. Although the M3 Pro (black) has only six P cores, as these are run at high QoS they can readily be accommodated on free E cores, resulting in the E core frequency being increased to around 2748 MHz. This does lead to a slow decline in throughput with 7-10 threads, but even when running threads on all its six P cores and four of its E cores, the M3 Pro achieves a throughput slightly higher than that of a single thread on an M1 Pro P core.

Power

Because power use is so different between E and P cores, I’ll consider E cores alone to start with.

m1m3epower

This chart shows total CPU power used by different numbers of threads running on the E cores of the M1 Pro (red) and M3 Pro (black). Although they start at an almost identical value for one thread (one core), they rapidly diverge, with the M3 remaining below the power used by two or more threads on an M1 Pro, even when it’s running 6-8 threads. This difference is substantial: at two threads, it amounts to about 150 mW, and at four it’s still over 100 mW.

In comparison, the P cores use 20-25 times as much power as the E cores.

m1m3pepower

Total CPU power use is shown here for both P and E core loads on the M1 Pro and M3 Pro. The lower pair of lines shows those for E cores alone, from the previous chart, and the upper pair show those for high QoS loads running on P and, when needed, E cores, for the M1 Pro (red) and M3 Pro (black). That for the eight P cores in the M1 Pro is linear up to its total of 8 cores, and points for the M3 Pro are close up to its cluster size of six P cores. These give the power cost of each additional P core at about 935 mW, for either chip. Above six threads, recruitment of E cores in the M3 Pro results in improving power efficiency, though. As the M1 Pro only has two E cores, overflowing threads from P to E cores isn’t such a good idea, because of its potential impact on low QoS threads that are confined to running on those E cores.

Conclusions

There are substantial differences in performance and efficiency between the CPU cores of M1 Pro and M3 Pro chips.
P cores in the M3 Pro consistently deliver better performance than those in the M1 Pro. Gains are greater than would be expected from differences in frequency alone, and are greatest in vector processing, where throughout in the M3 Pro can exceed 160% of that in the M1 Pro. These gains are achieved with little difference in power use.
E cores in the M3 Pro run significantly slower with background, low QoS threads, but use far less power as a result. When running high QoS threads that have overflowed from P cores, they deliver reasonably good performance relative to P cores, but remain efficient in their power use.
M3 Pro CPU cores are both more performant and more efficient than those in the M1 Pro.

Appendix: Source code

_intmadd: STR LR, [SP, #-16]! MOV X4, X0 ADD X4, X4, #1 int_while_loop: SUBS X4, X4, #1 B.EQ int_while_done MADD X0, X1, X2, X3 SUBS X0, X0, X3 SDIV X1, X0, X2 ADD X1, X1, #1 B int_while_loop int_while_done: MOV X0, X1 LDR LR, [SP], #16 RET

_fpfmadd: STR LR, [SP, #-16]! MOV X4, X0 ADD X4, X4, #1 FMOV D4, D0 FMOV D5, D1 FMOV D6, D2 LDR D7, INC_DOUBLE fp_while_loop: SUBS X4, X4, #1 B.EQ fp_while_done FMADD D0, D4, D5, D6 FSUB D0, D0, D6 FDIV D4, D0, D5 FADD D4, D4, D7 B fp_while_loop fp_while_done: FMOV D0, D4 LDR LR, [SP], #16 RET

_neondotprod: STR LR, [SP, #-16]! LDP Q2, Q3, [X0] FADD V4.4S, V2.4S, V2.4S MOV X4, X1 ADD X4, X4, #1 dp_while_loop: SUBS X4, X4, #1 B.EQ dp_while_done FMUL V1.4S, V2.4S, V3.4S FADDP V0.4S, V1.4S, V1.4S FADDP V0.4S, V0.4S, V0.4S FADD V2.4S, V2.4S, V4.4S B dp_while_loop dp_while_done: FMOV S0, S2 LDR LR, [SP], #16 RET

func runAccTest(theA: Float, theB: Float, theReps: Int) -> Float { var tempA: Float = theA var vA = simd_float4(theA, theA, theA, theA) let vB = simd_float4(theB, theB, theB, theB) let vC = vA + vA for _ in 1...theReps { tempA += simd_dot(vA, vB) vA = vA + vC } return tempA }

21Comments

Add yours

1

Thomas Holm on November 27, 2023 at 12:19 pm

Howard,
I realise there is absolutely no way I can express my appreciation for your work enough!
As to the subject at hand, I’d just like to add that I found a (for once) serious youtube video that outlines the difference between the the various M3 chips and their predecessors. If you haven’t seen it already, it might interest you.
Best regards, Thomas

LikeLiked by 1 person
- 2
  
  hoakley on November 27, 2023 at 4:00 pm
  
  Thank you. There’s a great deal more to come yet.
  Thank you also for that link, which appears to have been stripped from your comment, so I have restored it. That’s the second YouTube video that I have watched in the last year or two and been pleased that I devoted the time to it. It’s an excellent structural analysis, although of course tells little about the performance differences that I’m looking at.
  Howard.
  
  LikeLike
3

Maynard Handley on November 28, 2023 at 6:21 pm

Nice work! The first interesting point is that once genuine frequencies are taken into account, the IPC gains look higher than zero.(You might ask why the M3 P frequency didn’t ramp to full max. Presumably the heuristics for DVFS ramping have changed slightly, though who knows how exactly. Perhaps max frequency is only engaged after a few seconds of lower frequency, on the grounds that there’s no point in doubling power if the issue is saving .25 seconds of wait time?This might be an interesting issue to examine?)

The “NEON” and “simd” changes are also substantial and interesting. The relevant gating code is essentially a stream of dependent FADD’s. If FADD takes 3 cycles (the case for M1) then a 3GHz M1 can do 1B successive dependent FADDs, and that’s in fact the case. So how do we get to ~1.6 as fast? The best case frequency different is 4/3 which isn’t enough.

My GUESS is that the latency for FADD (perhaps under some specific conditions like when a result feeds directly into the same unit so there’s no waiting for the register from a common dispatch bus) FADD latency has dropped to 2.5 cycles (or more precisely something like 2 dependent FADDs in 5 cycles). If that’s the case then we expect performance to increase by something like 4/3*3/2.5=1.6.

Does this look correct, or am I forgetting something?

This is a cute idea, kinda like a very minimal “fusion” of certain patterns of sequential FP/vector instructions. We also saw something like this with sequential loads in M1, where a load whose result feeds into a load (think pointer chasing) can reduce latency by one cycle.

LikeLiked by 1 person
- 4
  
  hoakley on November 28, 2023 at 8:11 pm
  
  Thank you, Maynard. I think your suspicions are likely to prove correct. I will be focussing separately on non-scalar performance in a future article that might give further glimpses into what’s going on. I’ve collected most of the data, just have to analyse it and look deeper at potential AMX-related tasks. Apropos of which, I saw a good structural analysis on YouTube, with which I’m sure you’re already familiar, that claimed one AMX per 6-core cluster, based on analysis of images provided by Apple. If I can get code to run convincingly on the AMX, that could be interesting.
  Coming tomorrow, though, is power and energy.
  Howard.
  
  LikeLike
  - 5
    
    Maynard Handley on November 29, 2023 at 2:53 am
    
    I’m not convinced by the claims based on die images because the area that is highlighted as supposedly being AMX looks somewhat different on the Pro vs the Max!
    It’s not easy to figure out these areas, and I’m not sure they have it correct in this case…
    
    Real data always wins 🙂
    
    BTW I’ve been told that the sort of thing I am suggesting is present on both POWER10 and the newest Intel designs, so it’s certainly not a CRAZY idea.
    What is perhaps problematic (and which may either squelch the idea, or be a significant Apple advance) is that both the Intel and IBM schemes require fusion-like back to back instructions. In both your code snippets (raw NEON and Accelerate) it seems like the FADD V2, V2, V2 ‘s are separated by other instructions. One could imagine some sort of fancy reordering while in the scheduling pipeline, but the situation is certainly not cut-and-dried…
    
    You might want (for fun!) to try various alternatives to see how this plays out.
    The M1 latencies are available here: https://dougallj.github.io/applecpu/firestorm-simd.html
    so you could try various things (FMUL, FMADD, MUL [ie integer]) combinations like FADD feeding an FMUL, and so on, and see how M1 compares to M3.
    
    One could also try things like separating a pair that one hopes might fuse by successively more and more intervening instructions, to see if there’s a “lookahead breakpoint” beyond which fusion of the sort I am suggesting doesn’t happen.
    
    LikeLiked by 1 person
6

roger on November 28, 2023 at 11:39 pm

Would you be willing to run a program that prints the result of mach_timebase_info() and report what it returns on these new machines? The resolution of the timer that’s available in user-space (and therefore, the resolution of mach_absolute_time) is actually fairly low on m1 and m2 (24mHz). The newer ARM ISA versions (8.6a and higher) specify that this counter should be 1gHz, which would match the resolution of mach_absolute_time() on intel machines, but as Apple publishes steadily fewer details about their CPUs it’s hard to tell if they’ve adopted 8.6 or not.

For reference, mach_timebase_info() returns a struct with a numerator and a denominator – if the underlying timer is 1Ghz, these should both be equal to 1.

LikeLiked by 1 person
- 7
  
  hoakley on November 29, 2023 at 8:09 am
  
  By a strange coincidence, I have written an app that does exactly that, among many other things: Mints. As I expected, the Mach time, both in native and Rosetta modes, is exactly the same as the M1, as shown here.
  Whatever could you be doing that requires finer time resolution than 41.67 nanoseconds?
  And AFAIK, the M3 instruction set is ARMv8.6-A, but whether Apple complies elsewhere is a matter for its licence with Arm.
  Howard.
  
  LikeLike
8

inn suu on December 19, 2023 at 2:23 pm

Somewhat related? Especially for anyone using macs for higher end audio and/or video intensive, this is somewhat related. They’re links so I hope they remain intact! Basically one of the leading DAWs (digital audio workstation) Ableton Live, fails to ‘properly’ spread cpu usage, so people can see, in some scenarios, a better performance on an m1 than an m3.

Realtime Multicore Issues for Apple Silicon (blog)

YouTube video

LikeLiked by 1 person
- 9
  
  hoakley on December 19, 2023 at 4:37 pm
  
  Thank you.
  Well, those issues seem to be very specifically audio-related, and I don’t see why they should involve “general performance”.
  Howard.
  
  LikeLike
  - 10
    
    Inn Suu on December 19, 2023 at 5:00 pm
    
    That’s quite an obtuse response, Howard. Yes, it is an audio-specific link, but involving the use or not of the Apple m-chips E and P. It’s there for anyone interested. Please feel free to remove the post.
    
    LikeLiked by 1 person
    - 11
      
      hoakley on December 19, 2023 at 6:09 pm
      
      I’m sorry, I don’t understand what was obtuse about it. While I appreciate the links, and wouldn’t dream of wanting to remove it, I’m not sure how either relates to general performance.
      The blog link you gave refers explicitly to issues with audio components that don’t follow the normal conventions of QoS, apparently. Those are issues for specialist audio software developers to work with, and maybe for Apple to explain more clearly. If QoS doesn’t apply to their software, then it’s completely different from everything else.
      The YouTube video was quite bizarre, in trying to measure the total capacity of all the cores to perform audio tasks. My clear recommendation here is that users shouldn’t expect to run their M-series Macs with user tasks frequently overflowing onto E cores. That’s not what they’re designed to do. A little arithmetic shows that, if you take tasks that fill 8 P cores on an M1 or M2 Pro/Max, and then try running the same tasks on an M3 Pro (worse, a binned version with only 5 P cores), then you’re short of P cores. and threads will overflow to E cores. That’s something you shouldn’t plan on using much of the time – it’s a capacity that gives you extra cores when you do need them, but shouldn’t be normal use.
      I did find it interesting that the blog article seemed to state that it wasn’t possible to get good use out of all the cores in an M3 Pro, while the video showed examples of audio software that can do so. There seems something of a disconnect there.
      So, what does any of this have to do with those considering the general performance of CPU cores in an M3 Pro? Sorry, you’ve lost me there. It has great relevance to those using DAWS in particular, and specialist audio software more generally, and for those users is important. So while I’m grateful to you, I don’t think those are concerns for general performance.
      Howard.
      
      LikeLike
    - 12
      
      Inn Suu on December 19, 2023 at 6:35 pm
      
      Thank you for tidying up the links.
      
      For me, and I realise they do not follow typical ‘test’ guidelines (that is the point of the somewhat), the videos/article are still giving real-world examples, yes of audio-specific uses, and explains to a layperson why a newer m-chipped Mac may not necessarily automatically perform as well as an older one on certain tasks with certain audio apps.
      
      And despite this audio-specific bent, it is trying to find out out how and when certain cores are used; and that question does apply broadly depending on the ‘app’, how it has been programmed to utilise those E & P chips. Audio is niche but worth knowing. The huge majority of your technical explanations are going to be over the heads of most readers. Including me. so I’m glad of your quick appraisal of the links.
      
      LikeLiked by 1 person
    - 13
      
      hoakley on December 19, 2023 at 8:57 pm
      
      “it is trying to find out out how and when certain cores are used”
      I have been working on this, and explaining it in detail, for the last couple of years now. I dread to think how many of my articles consider this, including in special situations like Game Mode and when running virtual machines. And I promise you even more to come – tomorrow, and more over the holiday season too!
      Howard.
      
      LikeLike
- 14
  
  hoakley on December 19, 2023 at 6:13 pm
  
  BTW I have tidied the links here so that both are more clearly accessible.
  Howard.
  
  LikeLike
15

miscerratica on January 29, 2024 at 11:36 am

I realize this might be off-topic, Howard, but I’ve been wondering for months now if the CPU cores on the M3 SoCs are the same as the ones in the A16 or the A17 iPhone chips. A lot of people seem to just be assuming that they’re the same cores as in the A17, but without any real information from Apple or any actual evidence. (This Notebookcheck article seems to be the only one hazarding a guess about the cores and the GPU compared to the smartphone chips: https://www.notebookcheck.net/Apple-M3-chips-appears-to-be-a-hybrid-of-the-A17-Pro-A16-Bionic-designs.764280.0.html)

Which do you think it is? I don’t know if you look at the A series chips at all, but I thought I’d ask anyway.

LikeLiked by 1 person
- 16
  
  hoakley on January 29, 2024 at 12:43 pm
  
  I think it’s very much on-topic, thank you.
  I keep well clear of A-series chips, as M-series are quite enough of a challenge at the moment!
  Thanks for that link. All I have seen so far is speculative, with various quotes of opinions, and none appears to have any hard evidence. So I really don’t know whether the M3 CPU cores are the same as any in A-series, or, if they’re different, how different they might be. I think the best answer for the moment is not to make any assumptions.
  Howard.
  
  LikeLike
17

Paul R on February 21, 2024 at 4:39 pm

Interesting that the performance difference between P and E cores seems to be just a matter of clock speed. Are there other differences? Do we know if the E cores have NEON unit, etc.?

LikeLiked by 1 person
- 18
  
  hoakley on February 21, 2024 at 5:51 pm
  
  Ah no it isn’t. Although I haven’t seen any analysis of the cores in the M3, in the M1 the E cores have roughly half the computational capacity of P cores. Yes, E cores still have full floating-point support and NEON vector processing, but their capacity is lower, so using less power.
  You may find my more systematic account I’m writing in my new series on Apple silicon helpful.
  Howard.
  
  LikeLike
19

Charles on February 23, 2024 at 9:41 am

Howard,

It is really a wonderful work! I’m a student in China and also want to make some analysis in Apple M3. And may I know that which GUI you used to make the analysis? If possible, I’d like to follow your step and do some analysis by myself in order to be the man who like you.
Thank you, Howard~

Charles

LikeLiked by 1 person
- 20
  
  hoakley on February 23, 2024 at 3:36 pm
  
  Thank you.
  The code given in the Appendix was simply wrapped in a little AppKit and Swift app of my own that does the timing and handles the interface.
  The app used for the graphics in this article is DataGraph, from the App Store.
  I wish you success!
  Howard.
  
  LikeLike
  - 21
    
    Charles on February 26, 2024 at 1:52 am
    
    Thank you, Howard. It is really helpful~
    
    LikeLiked by 1 person