Explainer: Benchmarks

There’s nothing as controversial in computing as benchmarking. The basic concept is simple: select a test, time how long it takes to perform, and compare that result with those on other systems. Then, when the comparison doesn’t go the way you wanted, pick a different test, and so on, until you satisfy your confirmation bias.

Taking my tongue slightly out of my cheek, we’re only too well aware of how misleading benchmarks can be, and how popular arguing about them becomes. Let me explain how they could work.

Everything we do on and with our Macs is limited somewhere by performance. In each system and every action there’s a bottleneck or rate-limiting step which determines how long different tasks take. Good engineering aims to remove each of those bottlenecks, but there’s always the next.

There are two basic ways to tackle the assessment of performance: test each individual step, such as GPU and disk performance, one at a time, and test them altogether by performing the whole task. These form the two main types of benchmark: synthetic, and application.

Synthetic benchmarks are specially constructed to test individual aspects of overall performance. Historically important examples include Whetstone and Dhrystone benchmarks, which for many years were popular benchmarks for floating-point and integer performance of CPUs, respectively. As processors have advanced, these have become more complex. A simple example of pure synthetic benchmarks is my own app AsmAttic, which uses tight loops of (mostly) assembly code which don’t even access memory, and run entirely on the core and its registers. For measuring transfer speeds of storage, I offer Stibium, which you can use on everything from hard disks to internal SSDs in the Mac Studio.

While synthetic benchmarks are ideal for discovering performance bottlenecks, and comparing variants of the same core architecture, they tell you precious little about the overall performance you could expect when running apps. For that you should turn to application benchmarks.

The most meaningful application benchmarks for any user are carefully timed tests of real-world tasks which take time, in a production version of the app normally used. These are highly individual. For an app developer using Xcode, there are obvious examples in the time taken to build a particular project; 3D rendering and video processing are also amenable to equivalent tests. If you care about performance, it’s worth designing your own and measuring them carefully on each Mac you have access to.

Between synthetic and application benchmarks is an uncomfortable compromise in tests which aim to assess performance on standardised app tasks. Among the most popular, on Macs at least, is Primate Labs’ Geekbench 5. These assemble a series of representative tasks in a performance category, such as CPU and ‘Compute’, to generate a score which can be compared across different platforms and architectures, in this case including macOS, iOS, Windows, Android and Linux.

These intermediate benchmarks don’t provide the same depth and detail as synthetic benchmarks should, and don’t tell you whether the tasks you use in your main working apps will be faster or slower on another system. However, they’re the only way of drawing comparisons between completely different platforms. How you interpret observed differences is another problem.

Perhaps the greatest shortcoming of all benchmarks is how poorly they translate into the perception of performance by the user. I’ve written elsewhere about the psychology of computer performance, and how even small delays in the human interface are amplified. If opening a Finder window normally takes one second, you’ll notice its reduction to 0.8 s much more than you would a Time Machine backup taking eight minutes rather than ten.

To give a topical example, results from my AsmAttic synthetic benchmarks are essentially identical for each of the Performance cores across the whole range of M1 chips currently used in Apple Silicon Macs. That’s unsurprising, as those P cores are essentially the same. They also give similar Geekbench 5 single-core scores of around 1770.

Multi-core scores become more complex. When synthetic tests are confined to P cores, M1 Pro and Max chips are almost exactly twice as fast as the original M1, as they contain twice the number of P cores. Geekbench also runs on the Efficiency cores, which deliver lower performance compared with P cores. As the M1 Pro and Max only have two E cores, their contribution to the total score is less than the four E cores of the original M1. As a result, the M1 Pro and Max deliver less than twice the multi-core score of the original M1.

However, according to Geekbench 5, all the M1 chips are significantly faster than the eight cores in an Intel Xeon W processor in my base iMac Pro. The latter has a single-core score of around 1100, and multi-core of nearly 8000; for the M1 Pro, those are around 1800 and 12500 respectively, which you’d expect to translate into faster apps. But individual application benchmarks vary considerably, with some users reporting little difference between comparable Intel and M1 Macs.

Comparing GPUs is even more tricky, because of the variation in different tests. Mac GPUs, particularly those in Apple Silicon models, rely on Metal for best performance, but some graphics benchmarks don’t even run on the GPU, and others use cross-platform libraries such as OpenCL, which is deprecated by Apple so unlikely to achieve good performance on GPUs in M1 models. Some compilations of benchmark results, such as those for Blender, place M1 GPUs far below those on NVIDIA graphics cards, presumably because their tests make poor use of Metal’s speed.

In this context, benchmark testing provided in Affinity Photo might form a better comparison. For combined vector and raster graphics running on the GPU using Metal, those give a score of 5794 for the AMD Radeon Pro Vega 56 in my iMac Pro, and 21880 for the 24-core GPU in the M1 Max chip in my Mac Studio. Maybe I just find those more credible because they confirm what I expect, having spent so much on the new Mac Studio. Which takes me back to where I started, with benchmark testing just being a sophisticated form of confirmation bias.