When idiomatic code is slower, and how to Accelerate

The arrival of Apple’s M1 Macs has focussed our attention on performance. Although there have always been a few who rush out and run the Geekbench benchmarks on their latest go-faster Intel Mac, the rules seemed fixed: the more recent your Mac and the more you paid for it, the faster it was. With base M1 models now a fraction of the cost of many Intel Macs, Apple has turned this on its head, for the moment at least.

There’s a great deal more to attaining better performance that merely using a faster processor, though. In this article I look at one aspect that we all too rarely seem to think about: how well code is optimised for speed.

Almost all code, at least that in apps, is now written in high-level languages like Objective-C and Swift. With modern optimising compilers, it may seem that they’ll guarantee best performance in all but the most marginal of cases. But in Swift, in particular, there are often many different ways that a developer can code even everyday features. To look at this, I’ve taken the example of the vector dot-product, a fairly common calculation in many apps, which is particularly amenable to acceleration. What this does is take two vectors of equal length, multiply corresponding elements within them, and total those products to yield a single number – something you could readily do on a calculator.

It’s also a task which is suitable for parallel processing. If you had a processor which was capable of performing the multiplications in parallel, then this task could proceed much faster. Unsurprisingly, both Intel and ARM processors support what are known as Single Instruction, Multiple Data (SIMD) extensions to do this. Not only that, but Apple provides libraries of functions to support SIMD, including dot-products for a range of different types and lengths of vectors.

I have therefore coded a simple task which performs many of these dot-product calculations on pairs of vectors containing four 32-bit floating point values. I’ve coded this in three different ways using Swift, one of which calls Apple’s SIMD library to perform as much of the calculation as possible, and I’ve also had a go at writing this in NEON (SIMD) assembly code too. To give you an idea of what’s involved, here are the heavy-lifting sections from each version.

Code

I always feel guilty about writing pedestrian Swift, which doesn’t use the idioms the experts are so fond of. To see how a Swifty’s code might run, I used:
tempA = zip(vA, vB).map(*).reduce(0, +)
to perform the dot-product itself, and refer to this as “idiomatic”. This may appear opaque (as I’m sure it’s intended!), but what it does is ‘zip’ the two vectors together, ‘map’ in the multiplication, and ‘reduce’ the products by totalling them. Very succinct.

My sort of Swift is typical of a novice, who hasn’t yet discovered the power of the language:
tempA = 0.0
for i in 0...3 {
tempA += vA[i] * vB[i]
}

which should be self-explanatory. I’ll refer to this as the “plain” code, and apologise that it’s so ugly and unSwiftian.

Calling the SIMD library from Swift is the simplest of the lot:
tempA = simd_dot(vA, vB)
where vA and vB are specially-created simd_float4 vectors of Floats.

Finally, my novice NEON assembly language routine calls
FMUL V1.4S, V2.4S, V3.4S
FADDP V0.4S, V1.4S, V1.4S
FADDP V0.4S, V0.4S, V0.4S

The first instruction performs the multiplication across corresponding elements in the two vectors, using operands which refer to the SIMD V registers, working as four Floats. The next two instructions total those products into the S0 register, to return to the calling code.

Results

Interestingly, whichever code I ran, on either my Intel-based iMac Pro or M1 Mac mini, the results of these calculations were identical. The only differences were the time they took to achieve those results.

Consistently the fastest was using Apple’s SIMD library on the M1, which I’ll standardise as 100% for purpose of comparison. Around 110% (‘ten percent slower’ if you like), but still very close, were the SIMD library on the Intel processor, and my own NEON assembly language on the M1. My plain Swift running on Intel then came in around 130%. I think you’d be hard pressed to notice much difference between these, which computed a million dot-products in around a thousandth of a second – that’s about four billion multiplications (together with additions and more) every second.

There was then something of gap before the slower horses came in. Next was plain Swift code running on the M1 Mac, at 485%, which is a little surprising. Bringing up the rear were the two versions of ‘idiomatic’ Swift, on the M1 at 15,100%, and on Intel at 26,200%. Perhaps at last my embarrassingly unidiomatic Swift has proved beneficial.

Conclusions

If you can find a suitable function in Apple’s SIMD or Accelerate libraries, use it. Unfortunately their documentation is less than minimal, but they achieve excellent performance on both Intel and M1 Macs.

By all means use ‘smart’ Swift idioms for code whose performance isn’t critical. But when you want your code to run fast, skip the style and write clean code which will optimise well.

With minimal coding effort, using SIMD on current M1 Macs out-performs far more expensive Intel processors. Watch out for the successor to the M1!

Appendix: typical performance times

Each is given for a million repetitions across identical 32-bit Float dot-products on vectors of four elements, in seconds.

  • M1 SIMD library 0.000946 s
  • Intel SIMD library 0.00103
  • M1 SIMD assembly 0.00106
  • Intel plain Swift 0.00125
  • M1 plain Swift 0.00459
  • M1 idiomatic Swift 0.143
  • Intel idiomatic Swift 0.248