Explainer: Vectors, Accelerate and poor performance on M1 Macs

Our Macs process a lot of numbers, some whole numbers (integers), but a great many are now floating-point. Old computer systems, for example, used to refer to locations on the display using a pair of integers like (234, 567), which might refer to the pixel 234 units across the display and 567 down from the top (or up from the bottom, according to convention). These days, display coordinates are a pair of floating-point numbers like (0.234, 0.567) which can then be scaled automatically to a point on any physical display, whatever its resolution.

These pairs of numbers also form vectors – one-dimensional arrays of numbers. Short vectors are commonplace: when working in 3D graphics, coordinates come in a triple (x, y, z), and can be extended to include alpha or transparency bringing their length up to 4 as in (x, y, z, ɑ). Rather than performing calculations on each number individually in that vector of length 4, it’s usually far quicker to work with all four at once, using what’s termed vector processing.

With a little careful design, these short vectors can use special units in a core which perform the same instructions on each number in the vector in parallel, what’s known as Single Instruction, Multiple Data (SIMD) processing. Let’s consider a vector representing the three coordinates of a point in 3D, which needs to be transformed by multiplying each number by a factor and adding another value, to give the result. In ordinary code, that could be performed by iterating through the three numbers in turn, requiring a total of three multiplications and three additions for every coordinate.

What if you designed a vector processing unit which took all three numbers and did the same on each at the same time? Instead of iterating three times to scale and shift each coordinate, the core would then be able to perform those operations in parallel. That’s just what Intel, ARM and others have done.

Intel processors introduced in 2008, with the Sandy Bridge chipset, brought Advanced Vector Instructions, AVX, which does exactly that, and owe their origins to the Streaming SIMD Extensions (SSE) in the Pentium III back in 1999. They have since been enhanced in AVX2, from Haswell onwards, which in this case would allow them to perform multiply and add operations at the same time too, bringing further speed improvements and greater accuracy.

ARM processors are just as sophisticated, in what’s termed NEON. The cores in M1 series chips each have their own units for performing floating-point and NEON SIMD operations. For example, code might load one 128-bit register with four 32-bit floating-point numbers, or integers, and then execute fused multiply and add instructions on them. What would have required four separate multiplications and four additions now takes just a single instruction, resulting in a substantial acceleration achieved by vector processing.

Normally, though, code written in a high-level language like Objective-C or Swift doesn’t make use of Intel or ARM vector processing at all. There are two good ways that it can: by calling special optimised code, perhaps written in assembly language, or using Apple’s Accelerate maths functions, which in turn call code which is optimised for the processor being used.

Also included in the M1 series chips is an undocumented set of matrix maths extensions, known unofficially as AMX2. Those cater for two-dimensional arrays of numbers, matrices, which can readily become too big to handle efficiently using NEON, and for longer vectors. Other processing units within the M1, including the Neural Engine and GPU, can also offer powerful support for specific types of mathematical calculations involving vectors and matrices. The only way that those can be accessed is through the Accelerate library, and in some cases in Metal calls.

Some third-party code implements its own vector and matrix processing. Because that’s limited, compared to that available in the Accelerate libraries, it can result in relatively poor performance when ported to ARM code. There are also problems in running some accelerated Intel code using Rosetta 2: because Intel vector processing extensions are so complex, AVX and later vector instructions can’t be translated into ARM code at all.

An app which makes good use of Accelerate should therefore show significant improvement in performance when run on an M1 series Mac. Others which rely on their own solutions may show no acceleration at all, and in some cases could actually run slower. Pick your benchmarks carefully.