What changed CPU performance from the Macintosh 128K to the M3?

Over the 40 years since Steve Jobs launched the Macintosh 128K on 24 January 1984, Macs have been continuously improving their performance, as have all computers, of course. There are many ways that has been achieved, and they have come to process data ever faster. This article looks at some of the techniques that have been used to accelerate the CPUs of Macs over those years, and how these have changed.

Faster

CPUs execute instructions in synchrony with a clock whose frequency determines the rate of instruction execution. The Motorola 68000 processor in that Mac 128K ambled along at a clock speed of just 8 MHz. By 2006, the first Intel processor used by a production desktop Mac ran at a frequency of 1.83 GHz, over 200 times as fast. By 2007, the eight cores in my first Mac Pro had reached 3.0 GHz, but in 2022 the Performance cores in my Studio M1 Max topped out at 3.2 GHz, just over 400 times as fast as the first Mac.

These changes in frequency are shown in the two charts below.

maccpuhistoryfreqlin

This chart uses a conventional linear Y axis to demonstrate that frequency rose rapidly during the decade from 1997. As the form of this curve is S-shaped, the chart below shows the same data with a logarithmic Y axis.

maccpuhistoryfreqexp

Since about 2007, Macs haven’t seen substantial frequency increases. Many factors limit the maximum frequency that a processor can run at, including its physical dimensions, but among the most significant in practical terms are its power requirements and heat output, hence its need for cooling. Some of the last Power Mac G5 models ran their dual processors at 2.5 to 2.7 GHz, could use a steady 600 W of power, and had to be liquid-cooled. Most died early when their coolant started to leak.

Two ways to beat that limit on frequency are multiple cores and processing more data at once.

More cores

Adding more processor cores has been an effective way to run more code at the same time. Tasks are divided into threads that can run relatively independently of one another. Those threads can then be distributed across several CPU cores. My 8-core Mac Pro of 2007 blossomed into the 2019 Mac Pro that could have as many as 28 cores running at 4.0 or 4.4 GHz and drawing up to 900 W. In contrast, the current Mac Studio M2 Ultra has 24 cores but requires less than a third of that power.

maccpuhistorycores

This chart shows how the number of processors and cores inside Macs didn’t start rising until around 2005, just as frequencies were topping out. Thus, many of the CPU performance improvements from 2007 onwards have been the result of providing more cores. But there’s a practical limit as to how many of those cores will get used, which is where processing more data becomes important.

More data

Threads are normally relatively large chunks of code. Single instruction, multiple data (SIMD) works at the other end of the scale, and with a little ingenuity can return greatest speed improvements with little additional power or heat load.

The way this works is deceptively simple. As an example, I’ll take a section of code that has to multiply floating-point numbers. To perform that once, two registers in the CPU core’s floating-point unit are loaded with the numbers, the instruction multiplies them and leaves the result in another register. That’s fine when you’ve only got to do that once, but what happens when you need to do it hundreds or thousands of times?

With SIMD, registers are packed with more than one number at a time, and the multiply instruction works on them all at the same time. This requires larger registers, and numeric formats using fewer bits. If the registers are 128 bits wide, they can accommodate two 64-bit double-precision floating-point numbers at once; with 32-bit single-precision numbers they can work on four at a time, and with 16-bit float16 or bfloat16 numbers, they can be multiplied in batches of eight, four times faster than 64-bit.

SIMD isn’t new by any means, and first came to PCs in 1996 in Intel CPUs. Ironically, one of the finest implementations was in PowerPC processors, in their AltiVec system. The biggest difficulty is in writing code that can make the best of its potential. Efforts have been made for compilers to identify and convert conventional code so that it uses SIMD, or for languages to have extensions to facilitate it. Apple currently supports SIMD and related techniques in its vast Accelerate and related libraries, which make the best use of hardware support in both Intel and Apple silicon chips.

To demonstrate how effective these libraries can be, I’ve tested my iMac Pro with an Intel Xeon 8-core 3.2 GHz CPU, and the Performance cores in an M3 chip, running conventional code, and calling an Accelerate function, to perform the same matrix multiplication of two 16 x 16 single-precision (32-bit) floating-point matrices.

Using conventional code on the Intel CPU, a single thread ran 62,800 multiplications per second; using Accelerate that rose to 4,100,000, 65 times faster. On the M3, conventional code ran 109,000 multiplications per second, and Accelerate boosted that to 5,500,000, 50 times faster. Compared to the gains achieved by relatively small increases in core frequency, or running on several cores, SIMD can have huge benefits.

Among the major problems with the SIMD approach are that not all time-consuming code is suitable for this treatment, and some still has to be run conventionally. In other situations, the bottleneck may not be in the CPU core at all, as it may spend many of its cycles waiting on memory. Most of all, though, the coder must identify and use appropriate functions in the Accelerate library, rather than writing their own code. The Appendix below gives you an idea of how different the code is for the matrix multiplications I used for testing.

Speed of execution isn’t the only reason for using SIMD. Although I haven’t measured the power used by Intel CPUs, there’s a substantial difference in the cores of an M3 chip: when running a single thread of conventional code, one Performance core used 6.5 W, but the Accelerate function used only 5.5 W. Given the fact that the conventional code took 50 times as long for the same number of multiplications, using that conventional code costs 60 times as much energy for the same task as the Accelerate function. That would make a big difference to battery endurance, and to the need for cooling.

Timeline

This has been a greatly simplified overview, and there have been a great many other changes in CPUs over those 40 years, but those eras span:

1984-2007 increasing CPU frequency
2005-2017 increasing CPU core count
1998- increasing data throughput with SIMD.

Appendix: Source Code

Classical Swift matrix multiplication of 16 x 16 32-bit floating point matrices

var theCount: Float = 0.0 let theMReps = theReps/1000 let rows = 16 let A: [[Float]] = Array(repeating: Array(repeating: 1.234, count: 16), count: 16) let B: [[Float]] = Array(repeating: Array(repeating: 1.234, count: 16), count: 16) var C: [[Float]] = Array(repeating: Array(repeating: 0.0, count: 16), count: 16) for _ in 1...theMReps { for i in 0..<rows { for j in 0..<rows { for k in 0..<rows { C[i][j] += A[i][k] * B[k][j] }}} theCount += 1 } return theCount

In the ‘classical’ CPU implementation, matrices A, B and C are each 16 x 16 Floats for simplicity, and the following is the loop that is repeated theMReps times for the test.

16 x 16 32-bit floating point matrix multiplication using vDSP_mmul()

var theCount: Float = 0.0 let A = [Float](repeating: 1.234, count: 256) let IA: vDSP_Stride = 1 let B = [Float](repeating: 1.234, count: 256) let IB: vDSP_Stride = 1 var C = [Float](repeating: 0.0, count: 256) let IC: vDSP_Stride = 1 let M: vDSP_Length = 16 let N: vDSP_Length = 16 let P: vDSP_Length = 16 A.withUnsafeBufferPointer { Aptr in B.withUnsafeBufferPointer { Bptr in C.withUnsafeMutableBufferPointer { Cptr in for _ in 1...theReps { vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P) theCount += 1 } } } } return theCount

Apple describes vDSP_mmul() as performing “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

8Comments

Add yours

1

Adrian B on January 18, 2024 at 10:57 am

And if you want to go a bit further, “more instructions” : instruction pipelining, superscalar execution, and speculative execution (and probably others techniques I’ve forgotten about). Going from the M68000 (1 instruction at a time) to the M1 (even a single efficiency core can begin to execute up to 7 instructions at once, and have hundreds at various stages of execution). For in depth coverage of this area of Apple Silicon, the work of Dougall Johnson and Maynard Handley is fascinating (and I’m fairly sure I’ve seen it referenced on this blog before).

LikeLiked by 2 people
- 2
  
  hoakley on January 18, 2024 at 10:25 pm
  
  Thank you. Yes, there are of course a great many other changes, most of which have been more sustained, and which have improved what might be viewed as effective speed. I’m particularly grateful to Maynard Handley for his encouragement and suggestions in my own little looks at cores.
  Howard.
  
  LikeLiked by 1 person
3

Robert Tanis on January 19, 2024 at 3:45 am

Another seriously interesting article. Thank you for sharing as
every bit of new information helps increase my Mac enjoyment.
You may think it greatly simplified, but it is challenging for me.

Unfortunately, the Source Code Appendix is beyond my pay grade.
But not to worry as I don’t want to degrade the content of your blog.

LikeLiked by 1 person
- 4
  
  hoakley on January 19, 2024 at 8:36 am
  
  Thank you.
  The source code is only there for those who (think they) understand it. So many ask me to include source.
  Howard.
  
  LikeLike
5

Tristan Hubsch on January 19, 2024 at 2:51 pm

Aah, the memory maze (waay more than a lane)… From the opposite end of the expertise spectrum, I can proffer that (LaTeX 2.09) typesetting a 374-page book on my Mac SE (1MB RAM + 80MB SCSI/HD) warranted in 1991 a trip to the coffee machine. The 2024 revision (pdfLaTeX-π w/hyperlinks+more) on M1 MBP takes 1-2-3.60 seconds. 🥳

LikeLiked by 1 person
- 6
  
  hoakley on January 19, 2024 at 3:47 pm
  
  Thank you. Yes, real-time LaTeX is possible now.
  Howard.
  
  LikeLike
7

jrp on January 20, 2024 at 8:54 pm

It’s great that you provide code snippets. It’d be even better if you were able to provide a command line that ran them. Those above produce a bunch of undefined symbols if you just swiftc them.

LikeLiked by 1 person
- 8
  
  hoakley on January 20, 2024 at 8:57 pm
  
  I don’t run them from the command line, but from within a GUI app written in Swift. It doesn’t require much effort to build them into either if you want.
  Howard
  
  LikeLike