hoakley December 7, 2023 Macs, Technology

Evaluating M3 Pro CPU cores: 5 Quest for the AMX

Apple silicon chips provide several options for high-performance vector and matrix computation, including the NEON unit in each CPU core, which I examined in the last article in this series, the GPU and neural engine (ANE), and Apple’s own matrix coprocessor, the AMX. For most of these, Apple provides libraries that deliver the best performance appropriate to the hardware platform they’re run on. Apple doesn’t as a rule document which of those any given function will use.

AMX is intentionally undocumented, and its existence in Apple silicon chips remains shrouded in mystery. Research has documented the instruction set for M1 and M2 family chips, but precious little is known about the AMX coprocessor(s) in the M3 family. I’m very grateful to Maynard Handley, who suggested that I incorporate two test routines for use on the M3 Pro to try to elicit information about its AMX performance.

This article relies on explanations of the methods used in previous articles in this series:

If you’re not already familiar with the first of those, I recommend that you read it before this article, or you may well be mystified.

Tests

Results reported here come from two new tests that I have incorporated into my GUI wrapper app, both drawn from Apple’s Accelerate library, one from vDSP and the other a Sparse Solver. The code, given in full in the Appendix at the end, has been shamelessly stolen from Apple’s documentation.

The vDSP test performs a forward fast Fourier transform and an inverse transform on eight real elements, using two calls to vDSP_DFT_Execute(). The original example is given here. The Sparse Solver test uses a sparse Cholesky factorization, then uses that to solve its equation with SparseSolve(). Its original example is given here.

vDSP FFT

The relationship between the rate of executing vDSP FFT test loops and the number of threads is quite unlike any other that I’ve seen during testing. For 1-4 threads, this follows a linear relationship as seen generally, but the throughput rate then falls when going from four to five threads (cores). From five threads upwards, the relationship is again linear, but with a lesser gradient than with fewer threads.

m3prom1maxvdspfft1

This is shown clearly in the chart above, where results for the P cores in an M1 Max are shown in blue, and those from the P cores in an M3 Pro are in red. For both line segments, the gradient attained by the M3 is significantly steeper than that for the M1: the M3 Pro throughput was 135% that of the M1 Max, and the ratio of gradients is almost the same. However, at higher thread numbers, the M3 Pro achieved over 170% of the throughput of the M1 Max.

The reason for this fall in throughput when going from 4 to 5 threads isn’t clear, bearing in mind that the M1 has four-core clusters but the M3’s clusters contain six cores. Looking in detail at powermetrics measurements for the M3 Pro, there was a small reduction in P core frequency, from 3624 with 1-4 threads to 3576 with 5 threads, but that’s insufficient to account for the difference seen.

m3fftpactres1

This chart of total active residency for the six-core P cluster demonstrates that each thread added a single core’s worth of 100%, and with five test threads total active residency remained at 500%, well within the maximum total of 600% for the cluster.

m3fftppower1

There is something going on with these tests, though, as suggested by this chart of total CPU power used for 1-5 threads. Instead of being evenly spaced, they’re quite irregular and almost grouped in pairs. Note that powermetrics gives separate estimates for CPU and neural engine power, both of which remained close to zero throughout, but doesn’t make it clear whether AMX power use is included in its total CPU figures.

Whatever the reasons, the behaviour of the vDSP FFT test is very different from other tests that I have used, apart from the next.

Sparse solver

If the throughput of the vDSP FFT test looked incoherent, that from the Sparse Solver test appears almost random.

m3prom1maxspsolver1

Once again, in this chart blue points and lines are those of the M1 Max, and red are from the M3 Pro. Here there’s a marked discontinuity between two and three threads. Above three threads, the M1 appears to decline steadily, while the M3 is far higher and relatively steady.

On the M3 Pro, core frequencies were lower at 3516 MHz with one and two threads, and rose to 3580-3590 MHz with three and four threads, another difference too small to account for the changes seen in throughput.

m3sparsolverpactres1

Active residency was different from the vDSP FFT test, as it was significantly greater than that of the test threads alone. With a single test thread of 100%, total active residency on the cluster of P cores was about 150%, and with four threads totalling 400%, overall active residency totalled about 550% for the cluster. This would account for falling throughput with five threads or more, as they would exceed the 600% available from the P cores, but it doesn’t explain behaviour with four threads or less.

m3sparsolverppower1

Total CPU power was also irregular, suggesting something else is going on, as with the vDSP FFT test.

Different tests

Because the AMX coprocessor can only be used through Apple’s libraries, and it’s unclear which of their functions do perform their computation on the AMX, it’s not possible to conclude these results tell us anything about the AMX in either of these chips.

Code run in these tests is considerably more complex than the tight loops of assembly language I have used previously. Both test functions involve substantial overhead in terms of setting up variables both before the test loop and within it. Even moving as much as possible outside the loop, memory access from code inside the loop is inevitable.

In the Sparse Solver test, test code incurred substantial processing outside the test thread, sufficient to require an additional 50-150% active residency, and limit the six-core cluster of the M3 Pro to running a total of only four threads. While it’s possible that additional processing is required to support code running on the AMX, more information is required.

These two additional tests demonstrate how difficult it is to gain insights into core or coprocessor performance when running more complex code, and how useful tight code loops are by comparison.

Performance

If there’s one observation that shines through these clouds, it’s how performant the M3 Pro is when executing demanding tasks such as fast Fourier transforms and Cholesky decomposition. Relative to those of the M1 Max, throughputs of the M3 Pro attained 135% (FFT, low thread count), over 170% (FFT, high thread count), and 140% (Sparse Solver). Those can be attributed to the Accelerate library, the AMX coprocessor, improved core management and other factors, but teasing those apart requires much more work.

Conclusions

Interpreting the results of more complex tests is far harder because there are too many unknown and uncontrolled variables.
Throughput of vDSP FFT and Sparse Solver tasks shows complex relationships with the number of test threads.
Both tests showed discontinuities in the relationship between throughput and thread count, although at different points: for the FFT, it occurred between 4-5 threads, whereas the Sparse Solver discontinuity was between 2-3 threads, in a six-core cluster.
Relative to the M1 Max, the M3 Pro generally attained substantially higher throughput, ranging from 135% to over 170%.
Although it’s impossible to identify the factors responsible, the AMX coprocessor may well have contributed to those differences.

Appendix: Source code

func runvDSPFFT(theReps: Int) -> Float {
let realValuesCount = 8
var theCount: Float = 0.0
var complexReals: [Float] = [0, 2, 4, 6]
var complexImaginaries: [Float] = [1, 3, 5, 7]
for _ in 1…theReps {
if let dft = vDSP_DFT_zrop_CreateSetup(nil, vDSP_Length(realValuesCount), .FORWARD) {
vDSP_DFT_Execute(dft, complexReals, complexImaginaries, &complexReals, &complexImaginaries)
vDSP_DFT_DestroySetup(dft)
}
vDSP.multiply(1 / 2, complexReals, result: &complexReals)
vDSP.multiply(1 / 2, complexImaginaries, result: &complexImaginaries)
if let dft = vDSP_DFT_zrop_CreateSetup(nil, vDSP_Length(realValuesCount), .INVERSE) {
vDSP_DFT_Execute(dft, complexReals, complexImaginaries, &complexReals, &complexImaginaries)
vDSP_DFT_DestroySetup(dft)
}
vDSP.multiply(1 / Float(realValuesCount), complexReals, result: &complexReals)
vDSP.multiply(1 / Float(realValuesCount), complexImaginaries, result: &complexImaginaries)
theCount += 1
}
return theCount
}

func runSparseSolver(theReps: Int) -> Float {
var theCount: Float = 0.0
var columnStarts = [0, 3, 6, 7, 8]
var rowIndices: [Int32] = [0, 1, 3, 1, 2, 3, 2, 3]
var attributes = SparseAttributes_t()
attributes.triangle = SparseLowerTriangle
attributes.kind = SparseSymmetric
let structure = SparseMatrixStructure(rowCount: 4, columnCount: 4, columnStarts: &columnStarts,
rowIndices: &rowIndices, attributes: attributes, blockSize: 1)
var values = [10.0, 1.0, 2.5, 12.0, -0.3, 1.1, 9.5, 6.0]
var bValues = [ 2.20, 2.85, 2.79, 2.87 ]
var xValues = [ 0.00, 0.00, 0.00, 0.00 ]
for _ in 1…theReps {
let llt: SparseOpaqueFactorization_Double = values.withUnsafeMutableBufferPointer { valuesPtr in
let A = SparseMatrix_Double(structure: structure, data: valuesPtr.baseAddress!)
return SparseFactor(SparseFactorizationCholesky, A)
}
defer { SparseCleanup(llt) }
bValues.withUnsafeMutableBufferPointer { bPtr in
xValues.withUnsafeMutableBufferPointer { xPtr in
let b = DenseVector_Double(count: 4, data: bPtr.baseAddress!)
let x = DenseVector_Double(count: 4, data: xPtr.baseAddress!)
SparseSolve(llt, b, x)
} }
theCount += 1
}
return theCount
}

15Comments

Add yours

1

Chuck on December 7, 2023 at 11:41 am

You seem to be creating and destroying the FFT structure within your loop. Is this deliberate? (I would normally follow Apple’s recommendation reuse these structures when running repeated FFTs as they incur significant overhead.)

Great blog BTW!

LikeLiked by 1 person
- 2
  
  hoakley on December 7, 2023 at 11:44 am
  
  Thank you. I’m open to suggestions, but the intention of the code in the loop is that it’s reasonably self-contained. Maybe I should try both?
  Howard
  
  LikeLike
3

Maynard Handley on December 8, 2023 at 5:23 am

Well, Howard, I don’t think that was what anyone expected!
But I *do* think we can explain much of the results – if we allow ourselves a few assumptions…

The headline fact is two different AMX operations run faster on the M3 Pro (1 P cluster) than on the M1 Max (2 P clusters). Unless AMX has been massively modified (wider vectors or whatever) this suggests that the 6-wide P cluster has 2 AMX units.
The current way AMX is set up would then suggest each unit is tied to three specific P cores. (Because an AMX unit has a separate hardwired pool of registers reserved for each core that utilizes it; so each AMX core would have three of these pools).

Let’s note some facts:
– There is an Apple patent for modifying the OS scheduler so that if two threads both want to use AMX, and two P-clusters are available, then ensure that the threads run on separate clusters. This appears to work in the sense that no-one seems to be complaining that their M1/M2 Pro/Max are giving them no better AMX throughput than their M1/M2 (ie the AMX units available both seem to be activated when they should be.)

– One of the performance limits on AMX is transferring instructions from a core to AMX. Some number of instructions are packed into a single bus transaction and transferred from the source core to the target AMX unit. If you are engaged in a lot of overhead (eg setup of small matrices) the sheer number of transactions to AMX (and the fact that the bus is a shared resource) can be your limiting factor, so that adding threads (ie recruiting more cores) doesn’t help. If you are using AMX as a vector engine, I *think* as of M1 a core can usually pack two vector operations into a bus transaction, the bus can transfer one such transaction per cycle, and AMX can execute two such transactions. So you can (again regardless of number of cores) only execute about two AMX vector instructions per cycle. But if your vector instructions are only generated on the CPU every four instructions or so (because of various other setup and control overhead) then of course by launching say four threads you’re now going from executing two AMX vector instructions every four cycles to two AMX vector instructions every cycle.

Finally I can’t be sure (not seeing your entire code base, and not being a Swift expert) but it seems to me like both your FFT vectors and the sparse matrix are very small, like 8 long and 8×8 in size. This is bound to confuse the issue because there’s going to be so much overhead relative to actual AMX computation.

So with all this in mind, let’s consider a few questions:

1) Why does M1 sparse matrix throughput not double for two threads? It gets larger, sure, but by a fairly pathetic amount. If the OS is operating correctly then the second thread should be moved to the second P-cluster, and we should see a straight doubling; there’s no data sharing (is there?) or anything else that should prevent 2x speedup.
So my *guess* is that there’s just not enough AMX density in the code for the scheduler to ever mark a thread as an “AMX” thread and force scheduling to the second P-cluster.
So essentially the M1 story is something like
+ ideally (perhaps we would see this with rather larger [but still fitting within L2!]) we’d see 2-thread throughput as 2x 1-thread throughput (both AMX units active, and already saturated), then >2 cores maybe gives basically flat curve at the 2x level. (Be sure that even for the 8x core case all the matrices still fit in the two P-core L2’s!)
+ but with small matrices we are dominated by overhead, and maybe don’t even get properly scheduled by the OS to the two separate AMX cores. Mainly what we see is a thrashing of the two P-cluster buses by a stream of setup commands for each of these small matrices?

2) Now consider M3. We seem to see the same thing, though with somewhat better handling of bus thrashing (maybe each transaction to AMX can now carry more instructions?)
So with one caller thread we obviously activate one AMX unit.
With two caller threads again ideally we’d activate the second AMX unit. Instead (maybe seem as with M1 the AMX code density is not enough to mark these as “AMX” threads?) we seem to routing to the same AMX unit with some degree of thrashing.
With three caller threads we’re now going to be using *more than* three cores, so the second AMX unit now gets activated, and both AMX units are now in saturation.

Elements of this explanation I think make sense, some I’m not crazy about. I suspect patterns will be much clearer using dense large matrices, where pretty much all the work is at the AMX level, no additional control work (perhaps by NEON) to handle things like sparsity. Maybe much of the power signature we see is misinterpreted because it actually shows NEON work?
Maybe something like a loop multiplying two 80×80 matrices by each other would result in a simpler analysis?

Now let’s consider the FFTs.
In this case I think the M1 pattern is that the first thread runs on the first cluster (and sends out two vector instructions, packed into one AMX bus transaction, per cycle). The second thread runs on the second cluster and likewise sends out two vector instructions per cycle. The third thread recruits a second core in the first cluster, likewise for the fourth thread in the second cluster.
At this point we are close to saturating the bus, and adding more threads just adds thrashing and overhead without being able to execute much more compute.

M3 follows the same sort of pattern.
First two threads each route to a different AMX unit.
Next two threads again route to the two different AMX units. And we’re now close to saturating the P-cluster bus.
Fifth thread is adding slightly more overhead than we win from being able to use occasionally available unused cycles., and with six thread we’re probably fully saturating the bus.

My guess is
– if we used much larger vectors (but still all fitting within L2) we’d see much the same picture for both M1 and M3. BUT
– if we use even larger vectors we might see M3 pull ahead up till 4 threads.
M3 AMX *may* be capable of a type of prefetching (the patent exists, but is very recent and it’s never clear if a recent patent is present in a recent product) that
+ is not available to M1
+ is only available when not all cores “associated with” an AMX unit, so would be available when 1 or 2 cores are using an M3 AMX unit, but not once the third core is also using it.

Anyway, that’s what I can think of for now, if you’re curious enough about all this that in a few days you want to modify your code and run a whole new set of tests!

LikeLiked by 1 person
- 4
  
  hoakley on December 8, 2023 at 12:10 pm
  
  Thank you.
  With FFTs there’s no problem handling a larger vector in the tests, but I’m more wary about the Cholesky decomposition, as I’ll need to work out suitable matrices for that. But I will be pursuing this when I get time.
  I can give an answer on the FFT and core/cluster use: I’m afraid that the cores and clusters are used in the normal order, that is to say that no thread is run on the second P cluster until the first is already at 100% active residency in each core, i.e. it’s only with 5 threads that the second P cluster fires up. With 4 or less threads, it sits there idle.
  From the outset, the Cholesky decomposition appears completely different, and fires up both P clusters and runs on all cores, even at low residencies, at high frequency, from 1 thread upwards. The problem then is identifying what’s running on each core, as the whole subroutine is packaged into a single GCD thread, so would normally be expected to run on one core at a time. Maybe I need to use Xcode Instruments to work out why that is so different from all my other tests. I’m normally loathe to do that because of the effects of the measuring instrumentation itself.
  Work remains in progress, and I will report back with more analysis.
  Howard.
  
  LikeLike
  - 5
    
    Maynard Handley on December 8, 2023 at 6:24 pm
    
    Hmm! We have real investigation on our hands!
    
    Is there a good reason you are limiting yourself to Cholesky? If you just did something like create two large random matrices, then repeatedly multiplied them; or alternatively a random matrix which you repeatedly inverted, you’d remove at least one source of complexity in trying to understand the matrix side.
    
    It may well turn out that sharing behavior is rather different in matrix vs vector mode?
    A single core can easily saturate AMX in matrix mode, so there may be value in switching to the second cluster ASAP; whereas I think it may take two cores to saturate AMX in vector mode?
    
    And of course there’s always the buggaboo of simple bugs or oversights, eg in the scheduler! The idea is there but hasn’t yet been implemented, or whatever :-(
    eg someone in the early days implemented the cluster balancing idea for matrix code, but then no-one ever went back and even thought about cluster balancing for vector code?
    
    I do worry that in both cases there’s just so much overhead that we can get confused by what is pure noise. I hope we’d see much cleaner (and interpretable) signals with long vectors and large matrices.
    
    LikeLiked by 1 person
    - 6
      
      hoakley on December 8, 2023 at 7:37 pm
      
      Thank you.
      The big problem is documentation. I don’t know which calls might be implemented using AMX. The Accelerate libraries are huge, and include LAPACK and a lot more. I could readily code matrix multiplication in Swift, but that wouldn’t run on AMX. So I need to guess which Accelerate calls might do the job, and then discover how to use them. The two examples I have used are based on example code, but almost all of the libraries are almost undocumented. For example, I can only see calls for sparse and dense matrix multiplication, not just dense. Some parts like simd are only intended for small matrices (up to 4 x 4). I really am blundering around in the dark here.
      However, I have discovered why Sparse Solvers look so messy on cores: the library itself is multi-threaded. So my single threads are then farmed out to multiple threads, hence spread across cores. Of course that may be a pre-requisite for them to be run on AMX too. So what may appear to be noise is in fact the library working as intended, and possibly as required for AMX use.
      It’s not easy writing code by guesswork.
      Howard.
      
      LikeLike
    - 7
      
      Maynard Handley on December 10, 2023 at 2:55 am
      
      Welcome to the world of research :-)
      Try many many things, look at the patterns, and figure out an overall explanation!
      
      LikeLiked by 1 person
    - 8
      
      hoakley on December 10, 2023 at 7:45 pm
      
      Thank you for reminding me that next year I celebrate the 50th anniversary of starting research, originally looking at muscle metabolism during the spawning migration of the river lamprey. If we think this is difficult, try that instead!
      Actually good research in any field is usually prefaced by careful experimental design. Otherwise it’s just looking for needles in haystacks.
      But I do have further ideas that I’m going to try.
      Howard.
      
      LikeLike
9

Chad on December 10, 2023 at 7:07 am

You could try using apples MLX to test the AMX engines, it seems to make use of them for vector operations given some testing I’ve been doing. I’m using a vector based genetic comparison algorithm (it looks for the most similar individuals in a dataset of 89,000 individuals with 2k genetic markers each (2k sized vector)).

If I use a Scala based implementation of the algorithm (using the JDK vector API), restricted to a single core and targeting the 128bit vector extensions it takes about 75s to run (14s when targeting 8 cores).

Using a MLX based version of the algorithm in Python I see two cores (5 and 9 so diff clusters?) running at 50% complete the same job in 49s.

I’ve been testing this on a 12core M2 Pro (14″).

LikeLiked by 1 person
- 10
  
  Chad on December 10, 2023 at 11:03 am
  
  Interestingly the GPU in the 12c/19c M2 pro appears to be about twice as slow as using the CPU (AMX?) for the vector calculations using Apple’s MLX framework ( https://github.com/ml-explore/mlx ). Very easy to switch between the CPU/AMX and GPU with just one line of code.
  
  Also, assuming it is the AMX which is doing all the work, it really likes smaller datatypes. Switching to int16 from the default float32 results in a ~40% reduction in run time for the algorithm.
  
  If you are interested in having a look at it, this is a self-contained example that requires python3.11 numpy and mlx:
  
  import numpy as np
  import mlx.core as mx
  
  mx.set_default_device(mx.cpu)
  #mx.set_default_device(mx.gpu)
  
  npop = 80000 #population size increase to adjust run time
  nmarkers = 2048 #number of markers will also increase run time
  
  data = mx.array(np.random.randint(-1, 2, size=(npop, nmarkers)),dtype=mx.int16)
  
  def globalSearch(idx, parids, searchresults, gts):
  prosize = gts[idx].size
  test = gts[idx]
  tpars = mx.array(parids)
  parGTs = mx.take(gts,tpars,axis=0)
  val1 = mx.sum(mx.multiply(test.abs(),parGTs.abs()),axis=1)
  val2 = mx.sum(mx.multiply(test,parGTs),axis=1)
  res1 = mx.divide(mx.subtract(val1,val2),8.0)
  res2 = mx.subtract(prosize,val1)
  res3 = mx.subtract(prosize,mx.add(res1, res2))
  res4 = mx.divide(res3,mx.add(res3,res1))
  maxan = mx.max(res4)
  if maxan >= 0.9:
  maxind = mx.argmax(res4).item()
  searchresults[idx].append((maxind, res3[maxind].item(), res1[maxind].item(), res2[maxind].item(), res4[maxind].item()))
  
  results = {}
  
  for i in range(npop):
  results[i] = list()
  pars = np.random.randint(0,npop,size=int(npop/100))
  globalSearch(i,pars,results,data)
  
  LikeLiked by 1 person
  - 11
    
    hoakley on December 10, 2023 at 7:55 pm
    
    Thank you. Yes, I did take a quick look at MLX a couple of days ago, and am very grateful for your example code.
    The problem I’m up against here is packing tight, lightweight and intensive test operations into threads that make a minimum of memory access (ideally only registers) and run entirely within the single thread. I can then use GCD and QoS to distribute them to CPU cores, and progressively load them, as described.
    At the very least, I suspect that code puts demand on L2 cache, and I’m not sure that it would single-thread. That’s the undoing of the results for the Sparse Solver. So I’m going back to see if I can get some dense matrix operations going, then adjust their size so they remain as self-contained as possible, probably using vDSP and a lot of guesswork, trial and error.
    Howard.
    
    LikeLike
    - 12
      
      Chad on December 10, 2023 at 8:48 pm
      
      You should be able to cut it down to target registers only by dropping the profile size down to your estimated register size.
      
      So if using int16 and think you have 256bit registers then reduce the nmarkers to 16 (256b/16b). Then reduce the npop down to two and potentially loop the test.
      
      You could replace the the core of the algorithm with a simple loop of basic operations.
      
      LikeLiked by 1 person
  - 13
    
    hoakley on December 11, 2023 at 10:53 pm
    
    I’ve had a chance to do a bit more digging in the MLX source, and discovered that this is effectively a Python wrapper around a C/C++ wrapper around Accelerate library calls, which I can call directly using Swift, so cutting out the middlemen. Most of the calls are to C-BLAS, so I’ve dropped some similar calls directly into my own Swift GUI wrapper app. While they run impressively fast, they behave just as if they’re straight calls to CPU core units, rather than AMX. However, I’ll be looking at them in detail on both M1 and M3, and will write up my findings in the coming days.
    Thank you for all your help and encouragement.
    Howard.
    
    LikeLike
    - 14
      
      Chad on December 12, 2023 at 5:21 am
      
      Cheers,
      
      Interesting that it appears to you to be standard vector calls. As far as I’m aware the M1/2/3 P cores implement a 128bit SIMD vector unit. The scala implementation of the algorithm targets those 128b SIMD units directly using the Java vector API, which I believe is happening given the performance I see relative to other ARM, AMD and Intel CPUs. But the scala algorithm, which should be explicitly using the 128bit SIMD units, runs noticeably slower than the MLX CPU implementation when asking for a single thread. This is even though the Scala implementation uses a faster algorithm that can terminate searches before analysing the complete profile. Typically that algorithm should only be comparing the first couple of hundred genetic markers before moving to the next comparison, while the MLX has to compare all ~2000 genetic markers in these examples before it can move to the next profile. Yet regardless of this, the MLX implementation is 30-40% faster even though it’s processing over 100x more data and appears to only be using half the cpu time of two cores in different blocks. It doesn’t seem to make sense if it’s using the standard 128-bit vector pipeline, unless the scala implementation is several orders of magnitude slower. Which it shouldn’t be given it’s only about 2x slower than a C implementation of the same algorithm using the Intel MKL libraries for vector processing.
      
      LikeLiked by 1 person
    - 15
      
      hoakley on December 12, 2023 at 7:19 am
      
      When source code is provided, then it’s important to read it, to see which Accelerate calls are being used.
      Yes, I’m well aware of NEON, and code direct for it using assembly language. The previous article in this series demonstrates how its performance has changed from M1 to M3. What we don’t know is which Accelerate calls are likely to be implemented using NEON, and which using the AMX. Those might even be controlled by build options in the project – and that might account for the differences that you’re observing.
      Howard.
      
      LikeLike

·Comments are closed.