Code in ARM Assembly: Flow, pipelines and performance

Before moving on to look at integer and other instructions involving the general-purpose registers, this article rounds off the topic of controlling flow by considering performance, and compares two different conditional loop schemes with those generated by Apple’s Swift compiler. However good learning assembly language may be for the soul, and for those who need to be able to grok disassembled code, if you want to run your own assembly routines, the most popular reason is performance. However good modern compilers are, they have to make compromises when generating code. When you write your own assembly language you can optimise it for your particular purpose.

Pipelines

One of the problems with writing code for modern processors is that they work quite differently from the simple model most still have in our mind. Simple processors used to chug steadily through code, one instruction at a time: load that value into a register, multiply the contents of two registers putting the result into a third, and so on. As each instruction requires a series of smaller operations, such as decoding the operation itself, calculating a memory address, fetching a value from an address, and more, what modern processors do is process several instructions at a time.

This works a bit like a production line, with four or more instructions at various stages of completion at any one time, in the processor’s pipeline. When that processor is running simple unbranching code, it’s easy for it to keep the pipeline full and the processing of instructions running at maximum throughput. But the moment that a branch instruction enters the pipeline there’s a problem: which instruction should be loaded next, one that follows the branch, or one that doesn’t?

To ensure that its pipeline doesn’t grind to a halt each time a branch instruction is loaded, a processor uses heuristics to try to predict which branch will be followed, and then load those instructions. If branch prediction gets it right almost all the time, then performance remains essentially unaffected by branching. Normally, conditional loops, such as those used instead of for and while statements, can be predicted most reliably. Those used for if … else … cascades are harder to predict, so ARM64 assembly language offers conditional selection as a more efficient alternative – something I’ll consider in a future article in this series.

Evaluating performance

One ARM64 feature I’m particularly interested in is its small family of combined floating point multiply-and-add instructions such as
FMADD D1, D2, D3, D4
which multiplies D3 and D4, adds D2 to that result, and stores the final result in D1 (in this example). There’s an equivalent in Intel x86, and these combined instructions are known not for their effects on performance, but on reducing error, as the intermediate result of the multiplication can be kept in higher precision. To take a first tentative look at this, I coded two iterative loops thus:

for_loop: FMADD D0, D4, D5, D6 FSUB D0, D0, D6 FDIV D4, D0, D5 FADD D4, D4, D7 ADD X5, X5, #1 CMP X5, X4 B.LE for_loop
which tests in the tail, just like a conventional for loop, and
while_loop: CMP X5, X4 B.GE while_done FMADD D0, D4, D5, D6 FSUB D0, D0, D6 FDIV D4, D0, D5 FADD D4, D4, D7 ADD X5, X5, #1 B while_loop while_done:
which tests at the head, like a while loop.

To add a bit of excitement, I pitted my code against that generated by Apple’s Swift compiler, providing it with the code
for _ in 1...theReps { dZero = (tempA * theB) + theC tempA = ((dZero - theC)/theB) + theInc }
which explains what I am calculating. I then used the Hopper disassembler to see what the compiler produced:

mov x23, x0 adrp x27, #0x100005000 adrp x26, #0x100005000 adrp x25, #0x100005000 b.eq loc_10000430c ldr d0, [x27, #0xba0] ldr d1, [x26, #0xba8] fmov d2, #0x3ff0000000000000 ldr d3, [x25, #0xbb0] loc_1000042f0: fmul d4, d11, d0 fadd d4, d4, d1 fadd d4, d4, d3 fdiv d4, d4, d0 fadd d11, d4, d2 subs x8, x8, #0x1 b.ne loc_1000042f0

Despite dropping it a big hint, the Swift compiler didn’t use the combined FMADD instruction, but separate FMUL and FADD. Instead of its tail test using a separate CMP, it actually counts down with its loop counter, and uses a SUBS instruction, which effectively includes comparison with 0 to set the end of the loop. Swift optimises exceedingly well.

To compare between these, I’ve updated my AsmAttic app to obtain high-precision timings, and improved its output view. Version 2 of AsmAttic, complete with the source for this test, is available from here: asmattic2

Running performance tests in Xcode needs great care. For instance, my first test runs showed consistently that my assembly while loop was fastest, slightly ahead of the for loop, with Swift trailing a long way behind. But by default Xcode builds debug versions: that has no effect on my assembly code, but slows Swift down something rotten. So you need to build for distribution rather than debugging before you can make meaningful comparisons.

The call overhead, taken from running just one loop, is negligibly small across all three tests. Running a million loops, Swift was slightly faster than the while loop, and the for loop was slightly slower again. But increase the number of loops to 10 million or more and my hand-coded while loop was fastest of all:
Time for = 0.10187779100000001 seconds Time while = 0.063158666 seconds Time Swift = 0.071843375 seconds

What was disappointing, though, was that my crude estimates of error showed that Swift’s code is consistently more accurate. I clearly need to look in more detail at how FMADD performs, compared with separate instructions.

LDR or ADR?

If you’ve read Steven Smith’s article about building Hello World in assembly language for M1 Macs, you may now be wondering why my code doesn’t:

use ADR rather than LDR to load addresses;
use the .align 2 directive to ensure it starts on a 64-bit boundary.

The second is easy to address: his example is a standalone executable, which must be correctly aligned. My assembly routines are part of a whole project, and within that the build tools will ensure appropriate alignment of the whole.

ADR and LDR are less clear. In my assembly code, I use instructions such as
LDR D5, B_DOUBLE
to load the double defined at B_DOUBLE into the floating point register D5, which appears to work fine when used within a full Xcode app. I note that the code generated by the Swift compiler is similar, for example using
ldr d3, [x25, #0xbb0]
to load a double constant. I don’t know whether this is tolerated here because of the different build tools being used. However, if you do experience problems with LDR, you now know that it might need to be converted to ADR. And that finally does lead me on to the next article.

Previous articles in this series:

1: Building an app to develop assembly routines, including an explanation of calling assembly language from Swift, with a complete Xcode project
2: Registers explained
3: Working with pointers
4: Controlling flow
5: Conditional loops

Downloads:

ARM register summary
ARM operand architecture
Conditions and conditional branching instructions
Control Flow
AsmAttic 2, a complete Xcode project (version 2)
AsmAttic, a complete Xcode project (version 1)

References

Procedure Call Standard for the Arm 64-bit Architecture (ARM) from Github
Writing ARM64 Code for Apple Platforms (Apple)
Stephen Smith (2020) Programming with 64-Bit ARM Assembly Language, Apress, ISBN 978 1 4842 5880 4.
Daniel Kusswurm (2020) Modern Arm Assembly Language Programming, Apress, ISBN 978 1 4842 6266 5.
ARM64 Instruction Set Reference (ARM).

Share this:

Related