Are there flaws in some ARM64 instructions?

Floating point maths is a careful compromise between speed and accuracy. One widely used design feature in many processors is the use of fused instructions to perform both multiply and add in one fell swoop, that is to calculate
d = (a * b) + c
in one instruction, known as a fused multiply-add, rather than requiring a multiply instruction followed by a separate add. This has two potential benefits:

  • The intermediate result doesn’t need to be rounded, so the fused instruction gives scope for just a single rounding error rather than two.
  • The instruction can be optimised to reduce processor cycles and improve performance.

In practice, in most general-purpose processors, the greater benefits realised are in the reduction of rounding error.

In conjunction with my series here on assembly language programming for the ARM64, I’ve been looking at that processor’s fused multiply-add instruction FMADD, and have some puzzling results to report: so far, it appears that using the FMADD instruction rather than FMUL followed by FADD increases cumulative error, but is slightly faster. State-of-the-art compilers also seem to avoid using FMADD, and opt for separate instructions, suggesting that this may be a known shortcoming in the ARM64 implementation.

To assess this, I’ve been looking at very large numbers of iterative loops involving multiply-add operations. Expressed in Swift, these run through the loop
for _ in 1...theReps {
dZero = (tempA * theB) + theC
let tempB = ((dZero - theC)/theB)
tempA = tempB + theInc
}

This first calculates
d = (a * b) + c
then reverses that calculation using
a = (d - c)/b
which should of course equal the original value of a when the arithmetic is perfectly precise. In the loop, a is then incremented by 1.0 for the next loop, so the value of a at the end should equal the starting value of a (set by the user) plus the number of loops. In reality, this accumulates rounding and any other errors incurred in all the floating point arithmetic.

Assembly code for example routines is given in the Appendix at the end, including that generated by the Xcode 13.0 beta 3 (13A5192i) build chain. These were obtained by disassembling an optimised build using Hopper. Timing and cumulative error results obtained from a production M1 Mac mini were analysed using DataGraph.

Error

Lowest cumulative error was obtained throughout by code using separate FMUL-FADD instructions, rather than that using the fused instruction FMADD. For example, with one million iterations, the total cumulative error for FMUL-FADD was 0.000000418 (4.18e^-7), and that for FMADD 0.0000259 (2.59e^-5), which differ by a factor of over 60. There was a good logarithmic relationship between cumulative error and the number of iterations, with regressions showing that FMADD error was proportional to the number of loops to the power of 2.048, while FMUL-FADD error was proportional to the number of loops to the power of 1.899. Thus, the more iterations performed, the greater the difference in cumulative error.

If you want to minimise error, don’t use FMADD but separate FMUL and FMADD.

Speed

I looked at both head-tested and tail-tested conditional branching implementations. Using FMADD with a head test consistently delivers the best performance, and both conditional branching types using FMADD out-performed those using separate FMUL and FADD instructions. With a million iterations, differences were relatively small, though: relative to the fastest, tail-testing took 106% of the time, FMUL-FADD 118%, and compiled Swift 114%.

Performance benefits in using the fused FMADD instruction, or in using head-tested conditional branching, are small.

Swift

Compiled Swift code consistently optimises to tail-testing conditional branching using separate FMUL and FADD operations, and doesn’t appear to generate FMADD fused instructions despite phrasing the Swift source to encourage that. This suggests that those responsible for its code generation are aware of the performance of FMADD in terms of both error and speed.

ARM64 v Intel

I haven’t attempted to look at Intel processor fused instructions, nor make any systematic comparisons between the performance of the Swift code. However, considering just the results from one million iterations, the total cumulative error is the same as that for separate FMUL-FADD instructions on ARM64. Time taken on a 3.2 GHz 8-core Intel Xeon W processor was 0.00774 seconds, 108% of that for Swift on the M1. Yet again, the M1 demonstrates how it matches the performance of much more expensive processors.

Recommendation

If you use different tools and want to ensure best results from floating point arithmetic on ARM64, you may wish to check that code generation doesn’t use fused instructions, particularly on large loops which could accumulate significant errors. It’s worth bearing in mind that authoritative texts on floating-point arithmetic are also extremely cautious about the use of such fused instructions.

Appendix: Disassembled code

Example of the disassembled FMADD/tail test built in assembly language:
loc_100003838:
fmadd d0, d4, d5, d6
fsub d0, d0, d6
fdiv d4, d0, d5
fadd d4, d4, d7
subs x4, x4, #0x1
b.ne loc_100003838

Example of the disassembled FMUL-FADD/head test built in assembly language:
loc_100003878:
subs x4, x4, #0x1
b.eq loc_100003898
fmul d0, d4, d5
fadd d0, d0, d6
fsub d0, d0, d6
fdiv d4, d0, d5
fadd d4, d4, d7
b loc_100003878
loc_100003898:

Swift source code:
for _ in 1...theReps {
dZero = (tempA * theB) + theC
let tempB = ((dZero - theC)/theB)
tempA = tempB + theInc
}

Disassembled code as generated from Swift by Xcode:
loc_1000042e4:
fmul d4, d11, d0
fadd d4, d4, d1
fadd d4, d4, d3
fdiv d4, d4, d0
fadd d11, d4, d2
subs x8, x8, #0x1
b.ne loc_1000042e4

Example runtimes in seconds for one million loops:
FMADD/head test 0.00628 s
FMADD/tail test 0.00668 s
FMUL-FADD/head test 0.00744 s
FMUL-FADD/tail test 0.00727 s
Swift 0.00719 s.