Over the 40 years since Steve Jobs launched the Macintosh 128K on 24 January 1984, Macs have been continuously improving their performance, as have all computers, of course. There are many ways that has been achieved, and they have come to process data ever faster. This article looks at some of the techniques that have been used to accelerate the CPUs of Macs over those years, and how these have changed.
Faster
CPUs execute instructions in synchrony with a clock whose frequency determines the rate of instruction execution. The Motorola 68000 processor in that Mac 128K ambled along at a clock speed of just 8 MHz. By 2006, the first Intel processor used by a production desktop Mac ran at a frequency of 1.83 GHz, over 200 times as fast. By 2007, the eight cores in my first Mac Pro had reached 3.0 GHz, but in 2022 the Performance cores in my Studio M1 Max topped out at 3.2 GHz, just over 400 times as fast as the first Mac.
These changes in frequency are shown in the two charts below.

This chart uses a conventional linear Y axis to demonstrate that frequency rose rapidly during the decade from 1997. As the form of this curve is S-shaped, the chart below shows the same data with a logarithmic Y axis.

Since about 2007, Macs haven’t seen substantial frequency increases. Many factors limit the maximum frequency that a processor can run at, including its physical dimensions, but among the most significant in practical terms are its power requirements and heat output, hence its need for cooling. Some of the last Power Mac G5 models ran their dual processors at 2.5 to 2.7 GHz, could use a steady 600 W of power, and had to be liquid-cooled. Most died early when their coolant started to leak.
Two ways to beat that limit on frequency are multiple cores and processing more data at once.
More cores
Adding more processor cores has been an effective way to run more code at the same time. Tasks are divided into threads that can run relatively independently of one another. Those threads can then be distributed across several CPU cores. My 8-core Mac Pro of 2007 blossomed into the 2019 Mac Pro that could have as many as 28 cores running at 4.0 or 4.4 GHz and drawing up to 900 W. In contrast, the current Mac Studio M2 Ultra has 24 cores but requires less than a third of that power.

This chart shows how the number of processors and cores inside Macs didn’t start rising until around 2005, just as frequencies were topping out. Thus, many of the CPU performance improvements from 2007 onwards have been the result of providing more cores. But there’s a practical limit as to how many of those cores will get used, which is where processing more data becomes important.
More data
Threads are normally relatively large chunks of code. Single instruction, multiple data (SIMD) works at the other end of the scale, and with a little ingenuity can return greatest speed improvements with little additional power or heat load.
The way this works is deceptively simple. As an example, I’ll take a section of code that has to multiply floating-point numbers. To perform that once, two registers in the CPU core’s floating-point unit are loaded with the numbers, the instruction multiplies them and leaves the result in another register. That’s fine when you’ve only got to do that once, but what happens when you need to do it hundreds or thousands of times?
With SIMD, registers are packed with more than one number at a time, and the multiply instruction works on them all at the same time. This requires larger registers, and numeric formats using fewer bits. If the registers are 128 bits wide, they can accommodate two 64-bit double-precision floating-point numbers at once; with 32-bit single-precision numbers they can work on four at a time, and with 16-bit float16 or bfloat16 numbers, they can be multiplied in batches of eight, four times faster than 64-bit.
SIMD isn’t new by any means, and first came to PCs in 1996 in Intel CPUs. Ironically, one of the finest implementations was in PowerPC processors, in their AltiVec system. The biggest difficulty is in writing code that can make the best of its potential. Efforts have been made for compilers to identify and convert conventional code so that it uses SIMD, or for languages to have extensions to facilitate it. Apple currently supports SIMD and related techniques in its vast Accelerate and related libraries, which make the best use of hardware support in both Intel and Apple silicon chips.
To demonstrate how effective these libraries can be, I’ve tested my iMac Pro with an Intel Xeon 8-core 3.2 GHz CPU, and the Performance cores in an M3 chip, running conventional code, and calling an Accelerate function, to perform the same matrix multiplication of two 16 x 16 single-precision (32-bit) floating-point matrices.
Using conventional code on the Intel CPU, a single thread ran 62,800 multiplications per second; using Accelerate that rose to 4,100,000, 65 times faster. On the M3, conventional code ran 109,000 multiplications per second, and Accelerate boosted that to 5,500,000, 50 times faster. Compared to the gains achieved by relatively small increases in core frequency, or running on several cores, SIMD can have huge benefits.
Among the major problems with the SIMD approach are that not all time-consuming code is suitable for this treatment, and some still has to be run conventionally. In other situations, the bottleneck may not be in the CPU core at all, as it may spend many of its cycles waiting on memory. Most of all, though, the coder must identify and use appropriate functions in the Accelerate library, rather than writing their own code. The Appendix below gives you an idea of how different the code is for the matrix multiplications I used for testing.
Speed of execution isn’t the only reason for using SIMD. Although I haven’t measured the power used by Intel CPUs, there’s a substantial difference in the cores of an M3 chip: when running a single thread of conventional code, one Performance core used 6.5 W, but the Accelerate function used only 5.5 W. Given the fact that the conventional code took 50 times as long for the same number of multiplications, using that conventional code costs 60 times as much energy for the same task as the Accelerate function. That would make a big difference to battery endurance, and to the need for cooling.
Timeline
This has been a greatly simplified overview, and there have been a great many other changes in CPUs over those 40 years, but those eras span:
- 1984-2007 increasing CPU frequency
- 2005-2017 increasing CPU core count
- 1998- increasing data throughput with SIMD.
Appendix: Source Code
Classical Swift matrix multiplication of 16 x 16 32-bit floating point matrices
var theCount: Float = 0.0
let theMReps = theReps/1000
let rows = 16
let A: [[Float]] = Array(repeating: Array(repeating: 1.234, count: 16), count: 16)
let B: [[Float]] = Array(repeating: Array(repeating: 1.234, count: 16), count: 16)
var C: [[Float]] = Array(repeating: Array(repeating: 0.0, count: 16), count: 16)
for _ in 1...theMReps {
for i in 0..<rows {
for j in 0..<rows {
for k in 0..<rows {
C[i][j] += A[i][k] * B[k][j]
}}}
theCount += 1 }
return theCount
In the ‘classical’ CPU implementation, matrices A, B and C are each 16 x 16 Floats for simplicity, and the following is the loop that is repeated theMReps times for the test.
16 x 16 32-bit floating point matrix multiplication using vDSP_mmul()
var theCount: Float = 0.0
let A = [Float](repeating: 1.234, count: 256)
let IA: vDSP_Stride = 1
let B = [Float](repeating: 1.234, count: 256)
let IB: vDSP_Stride = 1
var C = [Float](repeating: 0.0, count: 256)
let IC: vDSP_Stride = 1
let M: vDSP_Length = 16
let N: vDSP_Length = 16
let P: vDSP_Length = 16
A.withUnsafeBufferPointer { Aptr in
B.withUnsafeBufferPointer { Bptr in
C.withUnsafeMutableBufferPointer { Cptr in
for _ in 1...theReps {
vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P)
theCount += 1
} } } }
return theCount
Apple describes vDSP_mmul() as performing “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”
