Explainer: Parallel computing

There are two basic ways that computers can perform their tasks faster: the processors inside them can run faster by increasing their clock frequency, and there can be more processors working in parallel with one another. One of the main limitations on the frequency of processors is that faster processors require more power, which in turn generates more heat, so requires more cooling. As a result, processor developers have increasingly turned to more cores as the solution to the need for greater speed.

It has been over twenty years since our Macs started containing more than one processor core. First, in 2000 the Power Mac G4 gained dual processors, then five years later, the Power Mac G5 brought the first 2-core processors. Over the following fourteen years the number of cores has grown steadily until 2019 when Apple introduced the first Mac Pro with as many as 28 as an option. Last week, Apple announced its first M1 chip with 20 cores.

Unless you’re an octopus with four displays and the concentration to interact with several apps at the same time, the only way that you’ll see worthwhile benefit from more than a few cores is when your most significant apps run in parallel on many cores at once, using parallel computing. This allows the hard work in an app to be performed across many cores, each running in parallel.

Take a simple example of file compression and decompression. The developer can write the code so that everything is performed in a single thread, each of the steps in performing the compression occurring in sequence. It then doesn’t matter whether your Mac has a few cores, or twenty, that code will take exactly the same time to compress a large file, whether you paid £/$/€ 700 for a Mac mini, or £/$/€ 4000 more for a Mac Studio with the latest M1 Ultra chip.

If the developer instead redesigns their code so that it works like a pipeline, it can make use of multiple cores. At one end, the original file could be read from disk, then streamed through each step of the compression process, until it was saved to disk at the end. Depending on the code used, this could be divided up into half a dozen steps, each of which could run on a separate core.

There are many tasks performed in some types of app, such as any that visualise a 3D image, which can be performed in parallel on very large numbers of cores. While many of those are now catered for in GPUs, and some use the M1 Neural Engine, there are still plenty which need regular CPU cores. To make this even easier in macOS, Apple introduced simple ways to create and manage multiple threads within apps, known as Grand Central Dispatch.

Parallel computing is inevitably more complex than simple serial computation. Sometimes different threads need to co-ordinate what they’re doing, to ensure that processing steps occur in the right sequence. As you can imagine, such co-ordination can lead to problems of its own, where threads can become deadlocked or livelocked if you’re not careful.

Although some tasks can benefit from massive parallelism, for many there are optimum numbers of threads, thus of processor cores. This is recognised in Amdahl’s Law, which (slightly restated) says that acceleration of a task by adding more cores will reach a maximum, beyond which more cores won’t bring any further benefit. That’s only true when the work to be done remains the same regardless of the hardware; in practice, more powerful hardware is normally used to tackle bigger tasks, which then invokes the more optimistic Gustafson’s Law instead.

Behind Amdahl’s Law is another more general phenomenon affecting the performance of all systems: the bottleneck or rate-limiting step. Consider an app which generates an image using methods which are amenable to parallel computing, and then has to write that image to disk in a process which normally takes the same time as generating the image in the first place. As you run that app on increasing numbers of cores, it will get progressively quicker until generating the image takes almost no time at all, but you then spend all your time waiting for the image to be written by disk, something that can’t be accelerated by adding any more cores, as it’s the rate-limiting step.

Even when you’re the developer of an app, tuning it for best performance on multi-core systems is often little better than educated guesswork. Predicting whether it will benefit from more cores can also be tricky, as even though it may be able to run its threads on all ten or twenty cores, the rate-limiting step may be down to a single thread which determines most of the time taken to perform a specific task.

Parallel computing is wonderful. It allows meteorological modelling of the whole globe to deliver detailed forecasts often for days in advance. It supports real-time manipulation of elaborate textured 3D models in complex surroundings. But there’s a whole lot of other more mundane tasks that don’t necessarily get any quicker.