hoakley August 6, 2021 Macs, Technology

Accelerating the M1 Mac: an introduction to SIMD

Modern processors pull a lot of tricks to go faster, even though their clock speeds haven’t changed much for several years. The most obvious is the use of multiple cores: although the very first Intel Macs had just one core, for the last 15 years, all new Macs have had at least two, and the top-of-the-range Mac Pro has 28. Those are great for distributing processes, but don’t help with accelerating individual operations. This article introduces an architectural feature which enables parallel processing at the opposite end of the spectrum, SIMD.

This is an abbreviation for Single Instruction Multiple Data, a simple concept which has become far more complicated. It’s a feature of all current Intel processors (used in Macs), and is in the ARM cores built into the M1 chip. My aim here isn’t to describe differing support for SIMD in Intel processors, nor to contrast Intel with ARM, but to explain how SIMD works, how it’s already being used to accelerate the M1, and why it matters to future Macs.

To run code in parallel on two or more cores, it needs to be divided up into discrete processes, chunks of code which can run reasonably independently, and consist of a whole series of instructions. The M1 chip also offers specialist processors, including its GPUs and Apple’s Neural Engine, which are used by system features such as Metal graphics, but again require discrete processes.

It’s common for software to process paired or similar data, which includes two- and three-dimensional co-ordinates, display colours such as red, green and blue, and more. Processing what are short vectors can be tackled in parallel too: for many operations, there’s but a single instruction to be performed on multiple units of data, SIMD. If you want a mental model, think of four small cores each performing the same action on a different value in the same vector. That’s essentially what SIMD does.

A common floating-point operation run in cores is multiplication. A typical instruction on an M1 core might be
FMUL D0, D1, D2
which takes the two double-precision (64-bit) floating-point numbers in registers D1 and D2, multiplies them together, and puts the result in register D0.

Some processes have to perform huge numbers of such multiplications, maybe even millions. To make these run faster, Intel and ARM cores run a ‘pipeline’ of instructions, so that at any instant, the core may be working on ten or more separate multiply instructions. However, each is at a different stage of execution, and only one gets completed at a time. It’s fast, but doesn’t run in parallel.

Like Intel processors, the M1 has another trick up its sleeve: the fused multiply-add instruction (FMA). In many cases, multiply operations are followed by addition, so the M1 offers a single instruction which does both, for example
FMADD D0, D1, D2, D3
which first multiplies D1 and D2, then adds D3 and puts the result in D0. That saves time, and only rounds the result once, so can in the right circumstances run quicker and give better results. But that improvement isn’t in the same league as performing two or more multiplication operations at the same time.

To do that, the processor treats the same floating-point register quite differently: instead of it being 64 bits wide, it becomes 128 bits, and is divided into lanes. To handle a vector (or group) of four single-precision 32-bit floating-point numbers, it treats those 128 bits as four 32-bit lanes. All the code has to do is to load this register (now designated V for vector) with the four numbers packed together, and run a single SIMD instruction to multiply them together. For example,
FMUL V0.4S, V1.4S, V2.4S
multiplies the vector of four single-precision numbers (4S) in the vector register V1 by that in V2, and puts the result in V0. If this operation takes the same time as the single multiply FMUL, then the SIMD version will run four times as fast.

You’ll have noticed one difference, though: the scalar multiplication uses double-precision (64-bit) numbers, while the vector version uses single-precision (32-bit). When software needs high precision, it will normally keep to slower scalar operations, but for many applications, 32-bit is sufficient. It’s a very old trade-off.

The code I’ve quoted here is written in assembly language, which is very unusual when writing software for modern Macs. To be able to take advantage of the SIMD features in a processor, the high-level code written by the developer has to be built so that it uses SIMD features. At present, the best way to do this is using special SIMD calls which Apple provides with its Accelerate features.

Intel processors have a long history of using SIMD, which goes right back to the Pentium 3 in 1999. Although modern Macs take benefit of much of this, the most recent extensions like AVX512 are seldom used by apps. Indeed, Rosetta 2 code translation on the M1 Mac doesn’t support AVX, AVX2 or AVX512 instructions, but that affects very few Mac apps. But all M1 Macs support the full ARM64 SIMD, also known as NEON. Developers who use Apple’s SIMD calls can thus expect that their Universal apps should run faster on Intel models, and at full pelt on M1 Macs.

In case this all seems very specialist, M1 SIMD works with a wide range of data types including both integers and floating-point, and sizes from double-precision down to single bytes. It can be used to process characters in strings, co-ordinates on the display, still images, video, audio, and more.

In coming weeks, in my series on ARM64 assembly language programming, I’m going to be going deeper into what’s involved in using SIMD, and I intend supplementing that with some articles exploring Apple’s SIMD and Accelerate features. To give you an idea as to how well SIMD can perform, look at these figures.

12Comments

Add yours

1

Thomas Tempelmann (@tempelorg) on August 6, 2021 at 9:59 am

Many years ago I occasionally wrote code for EyeTV. Back then, an incoming video stream, which, after decoding, ended up as invidual pictures that needed to be put into the computer’s video buffer so that they appear on the screen (that was before GPUs did all the work). More precisely, the images had to be put into a window’s memory buffer, which then would get drawn into the visible screen.

Now, this often required downscaling of the images to fit into the smaller window. Originally, EyeTV used a QuickTime function made exactly for this. And it was fast.

But it had a shortcoming: When viewing videos that showed a lot of parallel lines, you’d often get moiré effects. This was due to the fact that QT’s code would simply drop every nth line and row when downsizing an image. That made it fast but also generate ugly visuals (artifacts).

So I was tasked to write an optimized image downscaler that was using a better algorithm, which would involve _averaging_ N lines down into the smaller number of M lines. I wrote a proof-of-concept that did this, and we could see that the ugly artifacts would vanish. But it was very slow. So slow, that it would take longer than what a then-typical frame rate (25 or 30 Hz) required.

This was in the days of the PowerPC. Which offered Vector based operations, in the so-called Altivec unit.

Basically, this Altivec unit offered SIMD commands like the ones Howard outlined above in the article. I hand-optimized this code to the point where it was doing the downscaling of images at the SAME speed as QT’s – but with the better quality.

I’m sure football (soccer) viewers all over the world appreciated this when watching games with EyeTV, then :)

Not long after this, Apple introduced Intel CPUs, and EyeTV now also needed an optimized image scaling function that worked on Intel CPUs. Intel has their own SIMD instructions, then SSE2, which I used. SSE was much harder to code with, though. Altivec was very orthogonal, whereas SSE was a mess. I eventually also got it worked well enough, but it took me much longer (like, more than a week for a few 100 lines of assembly code).

LikeLiked by 1 person
- 2
  
  hoakley on August 6, 2021 at 1:29 pm
  
  Thank you, Thomas. As someone who greatly enjoyed using EyeTV, thank you for your code, it worked excellently.
  I understand that at least of one of the Altivec veterans is still working in Apple’s numerics. By any standard, Altivec was outstanding, and it’s interesting that you confirmed my suspicions about Intel SIMD being a mess. ARM64 NEON isn’t easy, but is surprisingly powerful and flexible given its lean RISC design and instruction set. I gather that SIMD and other enhancements in Intel processors were one motivation for Apple adopting ARM for the future.
  Howard.
  
  LikeLike
- 3
  
  Maynard Handley (@handleym99) on August 6, 2021 at 7:52 pm
  
  The earliest QuickTime blitters did, of course, have that problem. But one of the things I did soon after I arrived at Apple was write a blitter that could downsample (or upsample) without this moire being TOO obvious.
  You only got a single bit of interpolation, and that bit was diagonal (so you got both vertical and horizontal interpolation).
  This covered from about .5x to 2x adequately, but if you were trying tp push beyond that, yes there would be problems.
  
  The reason I wrote this was not exactly moire. MPEG (for stupid historical reasons) used non-square pixels, with an aspect ration like .9. If you displayed these as square pixels circles became ovals, faces were too wide, etc. Most of our competition did that, but it looked like garbage.
  If you simply dropped every tenth column you got not exactly a moire effect, more like watching a movie through a screen door — you could see lines of “something wrong” forming a regular pattern as the movies moved behind those lines. So I really needed a blitter that could say shrink an image by 10% in the x direction, but which was also cheap.
  After I wrote it for MPEG and it worked well, Peter H asked me to modify the main QT blitter to use the same algorithm — but by that time I guess EyeTV had already rolled their own.
  
  The problems of that time, of course, were completely different from today. Even something as basic as writing a pixel to a screen took depressingly long amounts of time, and some of my initial algorithm ideas were infeasible because they did not match the optimal way to get bandwidth when writing to AGP.
  
  I can’t remember if I ever rewrote that blitter for AltiVec. That was a time when things were changing so rapidly as GPUs were becoming mainstream (and even in their super-limited form back then, they could at least rescale video on their own) that we may always have had higher priorities.
  
  LikeLiked by 1 person
4

Warren Nagourney on August 6, 2021 at 7:29 pm

I also used AltiVec 20 years ago in an optical design program I wrote in order to design light collecting optics for a physics experiment. Ray tracing was completely amenable to SIMD parallelism and I obtained the factor of 4 improvement expected for 32 bit floats. I would have rather used 64 bits for a complicated optical instrument but 64 bit vectors weren’t available in the G4 and G5. They are available in ARM64, but only allow a factor of 2 improvement (due to the 128 bit size of vector registers).

LikeLiked by 1 person
5

Warren Nagourney on August 6, 2021 at 7:52 pm

By the way, the “mess” you described in the x86 world made me lose all interest in programming when Apple switched to Intel. AltiVec programming in objective C was essentially assembly language coding and was fairly straightforward due in large part to the register orthogonality mentioned earlier. This regularity in the instruction set was also present elsewhere in the PowerPC and ARM64 (but decidedly not in x86) – it appears that the ARM and PPC are architecturally very similar (as expected for a modern RISC CPU).

LikeLiked by 1 person
6

Ed on August 18, 2021 at 5:11 pm

The other day I started a really simple BigInt library in C, just for fun. I thought that at some point it might be optimised by these SIMD operations, but I’m a long way from that.

This was a very surprising first result: on my M1 Mac mini it took 10 s to calculate the 500,000th Fibonacci number (turns out it has more than 100,000 digits) but on my ancient MacBookAir6,2 (2013, 1.3 GHz i5 dual core, 4 GB) it was just 7.5 s! Both compiled with “clang -Ofast” from latest Xcode. Code at https://github.com/ednl/bigint/blob/main/bigint.c

LikeLiked by 1 person
- 7
  
  hoakley on August 18, 2021 at 7:01 pm
  
  Thank you.
  That’s interesting. Have you considered more radical optimisation options, to see if they have an effect on M1 performance? Sadly some of the finest features of the ARM64 seem not to be used unless you go for the radical. Perhaps this is intentional and based on devices?
  Howard.
  
  LikeLike
  - 8
    
    Ed on August 20, 2021 at 2:07 pm
    
    I couldn’t find anything better (or extra, different) than -Ofast which already “Enables all the optimizations from -O3 along with other aggressive optimizations that may violate strict compliance with language standards.” I did find out how to set thread priority and use xctrace which confirmed that the program runs on the performance cores exclusively. With larger memory blocks per (re-) allocation, the difference is smaller but still: M1 = 5.1 s, i5 2013 = 4.6 s.
    
    LikeLiked by 1 person
    - 9
      
      Ed on August 21, 2021 at 1:18 pm
      
      iMac14,2 also from 2013 but with 4-core i5 Haswell 3.2 GHz: 3.2 s. Gah! There *must* be better M1 optimisation options that I’m not seeing.
      
      LikeLiked by 1 person
    - 10
      
      hoakley on August 21, 2021 at 1:30 pm
      
      Have you looked at the assembly outputs, either using the build option or by disassembling the executables?
      Howard.
      
      LikeLiked by 1 person
    - 11
      
      Ed on August 21, 2021 at 4:06 pm
      
      Thanks, yes, that will be my next step but I’ll have to have to simplify the program first to be able to make sense of the assembly; this is new territory for me.
      
      LikeLiked by 1 person
    - 12
      
      hoakley on August 21, 2021 at 7:55 pm
      
      I do have my series on ARM64 assembly language which might help, but I’m no use with x86.
      One tip is to make the calls to the code you’re most interested in easily recognisable. For example, in my performance studies I use a call for Mach system time which is easy to spot in the assembly code.
      I’m always happy to help if you wish.
      Howard.
      
      LikeLiked by 1 person

·Comments are closed.

Share this:

Related