Accelerating the M1 Mac: an introduction to SIMD

Modern processors pull a lot of tricks to go faster, even though their clock speeds haven’t changed much for several years. The most obvious is the use of multiple cores: although the very first Intel Macs had just one core, for the last 15 years, all new Macs have had at least two, and the top-of-the-range Mac Pro has 28. Those are great for distributing processes, but don’t help with accelerating individual operations. This article introduces an architectural feature which enables parallel processing at the opposite end of the spectrum, SIMD.

This is an abbreviation for Single Instruction Multiple Data, a simple concept which has become far more complicated. It’s a feature of all current Intel processors (used in Macs), and is in the ARM cores built into the M1 chip. My aim here isn’t to describe differing support for SIMD in Intel processors, nor to contrast Intel with ARM, but to explain how SIMD works, how it’s already being used to accelerate the M1, and why it matters to future Macs.

To run code in parallel on two or more cores, it needs to be divided up into discrete processes, chunks of code which can run reasonably independently, and consist of a whole series of instructions. The M1 chip also offers specialist processors, including its GPUs and Apple’s Neural Engine, which are used by system features such as Metal graphics, but again require discrete processes.

It’s common for software to process paired or similar data, which includes two- and three-dimensional co-ordinates, display colours such as red, green and blue, and more. Processing what are short vectors can be tackled in parallel too: for many operations, there’s but a single instruction to be performed on multiple units of data, SIMD. If you want a mental model, think of four small cores each performing the same action on a different value in the same vector. That’s essentially what SIMD does.

A common floating-point operation run in cores is multiplication. A typical instruction on an M1 core might be
FMUL D0, D1, D2
which takes the two double-precision (64-bit) floating-point numbers in registers D1 and D2, multiplies them together, and puts the result in register D0.

Some processes have to perform huge numbers of such multiplications, maybe even millions. To make these run faster, Intel and ARM cores run a ‘pipeline’ of instructions, so that at any instant, the core may be working on ten or more separate multiply instructions. However, each is at a different stage of execution, and only one gets completed at a time. It’s fast, but doesn’t run in parallel.

Like Intel processors, the M1 has another trick up its sleeve: the fused multiply-add instruction (FMA). In many cases, multiply operations are followed by addition, so the M1 offers a single instruction which does both, for example
FMADD D0, D1, D2, D3
which first multiplies D1 and D2, then adds D3 and puts the result in D0. That saves time, and only rounds the result once, so can in the right circumstances run quicker and give better results. But that improvement isn’t in the same league as performing two or more multiplication operations at the same time.

To do that, the processor treats the same floating-point register quite differently: instead of it being 64 bits wide, it becomes 128 bits, and is divided into lanes. To handle a vector (or group) of four single-precision 32-bit floating-point numbers, it treats those 128 bits as four 32-bit lanes. All the code has to do is to load this register (now designated V for vector) with the four numbers packed together, and run a single SIMD instruction to multiply them together. For example,
FMUL V0.4S, V1.4S, V2.4S
multiplies the vector of four single-precision numbers (4S) in the vector register V1 by that in V2, and puts the result in V0. If this operation takes the same time as the single multiply FMUL, then the SIMD version will run four times as fast.

You’ll have noticed one difference, though: the scalar multiplication uses double-precision (64-bit) numbers, while the vector version uses single-precision (32-bit). When software needs high precision, it will normally keep to slower scalar operations, but for many applications, 32-bit is sufficient. It’s a very old trade-off.

The code I’ve quoted here is written in assembly language, which is very unusual when writing software for modern Macs. To be able to take advantage of the SIMD features in a processor, the high-level code written by the developer has to be built so that it uses SIMD features. At present, the best way to do this is using special SIMD calls which Apple provides with its Accelerate features.

Intel processors have a long history of using SIMD, which goes right back to the Pentium 3 in 1999. Although modern Macs take benefit of much of this, the most recent extensions like AVX512 are seldom used by apps. Indeed, Rosetta 2 code translation on the M1 Mac doesn’t support AVX, AVX2 or AVX512 instructions, but that affects very few Mac apps. But all M1 Macs support the full ARM64 SIMD, also known as NEON. Developers who use Apple’s SIMD calls can thus expect that their Universal apps should run faster on Intel models, and at full pelt on M1 Macs.

In case this all seems very specialist, M1 SIMD works with a wide range of data types including both integers and floating-point, and sizes from double-precision down to single bytes. It can be used to process characters in strings, co-ordinates on the display, still images, video, audio, and more.

In coming weeks, in my series on ARM64 assembly language programming, I’m going to be going deeper into what’s involved in using SIMD, and I intend supplementing that with some articles exploring Apple’s SIMD and Accelerate features. To give you an idea as to how well SIMD can perform, look at these figures.