Are there flaws in some ARM64 instructions?

Floating point maths is a careful compromise between speed and accuracy. One widely used design feature in many processors is the use of fused instructions to perform both multiply and add in one fell swoop, that is to calculate
d = (a * b) + c
in one instruction, known as a fused multiply-add, rather than requiring a multiply instruction followed by a separate add. This has two potential benefits:

The intermediate result doesn’t need to be rounded, so the fused instruction gives scope for just a single rounding error rather than two.
The instruction can be optimised to reduce processor cycles and improve performance.

In practice, in most general-purpose processors, the greater benefits realised are in the reduction of rounding error.

In conjunction with my series here on assembly language programming for the ARM64, I’ve been looking at that processor’s fused multiply-add instruction FMADD, and have some puzzling results to report: so far, it appears that using the FMADD instruction rather than FMUL followed by FADD increases cumulative error, but is slightly faster. State-of-the-art compilers also seem to avoid using FMADD, and opt for separate instructions, suggesting that this may be a known shortcoming in the ARM64 implementation.

To assess this, I’ve been looking at very large numbers of iterative loops involving multiply-add operations. Expressed in Swift, these run through the loop
for _ in 1...theReps { dZero = (tempA * theB) + theC let tempB = ((dZero - theC)/theB) tempA = tempB + theInc }
This first calculates
d = (a * b) + c
then reverses that calculation using
a = (d - c)/b
which should of course equal the original value of a when the arithmetic is perfectly precise. In the loop, a is then incremented by 1.0 for the next loop, so the value of a at the end should equal the starting value of a (set by the user) plus the number of loops. In reality, this accumulates rounding and any other errors incurred in all the floating point arithmetic.

Assembly code for example routines is given in the Appendix at the end, including that generated by the Xcode 13.0 beta 3 (13A5192i) build chain. These were obtained by disassembling an optimised build using Hopper. Timing and cumulative error results obtained from a production M1 Mac mini were analysed using DataGraph.

Error

Lowest cumulative error was obtained throughout by code using separate FMUL-FADD instructions, rather than that using the fused instruction FMADD. For example, with one million iterations, the total cumulative error for FMUL-FADD was 0.000000418 (4.18e^-7), and that for FMADD 0.0000259 (2.59e^-5), which differ by a factor of over 60. There was a good logarithmic relationship between cumulative error and the number of iterations, with regressions showing that FMADD error was proportional to the number of loops to the power of 2.048, while FMUL-FADD error was proportional to the number of loops to the power of 1.899. Thus, the more iterations performed, the greater the difference in cumulative error.

If you want to minimise error, don’t use FMADD but separate FMUL and FMADD.

Speed

I looked at both head-tested and tail-tested conditional branching implementations. Using FMADD with a head test consistently delivers the best performance, and both conditional branching types using FMADD out-performed those using separate FMUL and FADD instructions. With a million iterations, differences were relatively small, though: relative to the fastest, tail-testing took 106% of the time, FMUL-FADD 118%, and compiled Swift 114%.

Performance benefits in using the fused FMADD instruction, or in using head-tested conditional branching, are small.

Swift

Compiled Swift code consistently optimises to tail-testing conditional branching using separate FMUL and FADD operations, and doesn’t appear to generate FMADD fused instructions despite phrasing the Swift source to encourage that. This suggests that those responsible for its code generation are aware of the performance of FMADD in terms of both error and speed.

ARM64 v Intel

I haven’t attempted to look at Intel processor fused instructions, nor make any systematic comparisons between the performance of the Swift code. However, considering just the results from one million iterations, the total cumulative error is the same as that for separate FMUL-FADD instructions on ARM64. Time taken on a 3.2 GHz 8-core Intel Xeon W processor was 0.00774 seconds, 108% of that for Swift on the M1. Yet again, the M1 demonstrates how it matches the performance of much more expensive processors.

Recommendation

If you use different tools and want to ensure best results from floating point arithmetic on ARM64, you may wish to check that code generation doesn’t use fused instructions, particularly on large loops which could accumulate significant errors. It’s worth bearing in mind that authoritative texts on floating-point arithmetic are also extremely cautious about the use of such fused instructions.

Appendix: Disassembled code

Example of the disassembled FMADD/tail test built in assembly language:
loc_100003838: fmadd d0, d4, d5, d6 fsub d0, d0, d6 fdiv d4, d0, d5 fadd d4, d4, d7 subs x4, x4, #0x1 b.ne loc_100003838

Example of the disassembled FMUL-FADD/head test built in assembly language:
loc_100003878: subs x4, x4, #0x1 b.eq loc_100003898 fmul d0, d4, d5 fadd d0, d0, d6 fsub d0, d0, d6 fdiv d4, d0, d5 fadd d4, d4, d7 b loc_100003878 loc_100003898:

Swift source code:
for _ in 1...theReps { dZero = (tempA * theB) + theC let tempB = ((dZero - theC)/theB) tempA = tempB + theInc }

Disassembled code as generated from Swift by Xcode:
loc_1000042e4: fmul d4, d11, d0 fadd d4, d4, d1 fadd d4, d4, d3 fdiv d4, d4, d0 fadd d11, d4, d2 subs x8, x8, #0x1 b.ne loc_1000042e4

Example runtimes in seconds for one million loops:
FMADD/head test 0.00628 s
FMADD/tail test 0.00668 s
FMUL-FADD/head test 0.00744 s
FMUL-FADD/tail test 0.00727 s
Swift 0.00719 s.

8Comments

Add yours

1

Bob on July 19, 2021 at 12:16 pm

Hi, such optimized instructions are commonly available but not employed by default; see the Clang manual (and source code) regarding “fast-math mode … lets the compiler make aggressive, potentially-lossy assumptions about floating-point math” – https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior

LikeLiked by 1 person
- 2
  
  hoakley on July 19, 2021 at 3:19 pm
  
  Thank you.
  I’m very grateful to @stephentyrone for informing me on Twitter that Swift Doubles also have a call addingProduct() which specifically uses the fused instruction. I’ll be writing more about this in the future, I can see!
  Howard.
  
  LikeLike
3

Matt on July 19, 2021 at 1:23 pm

AArch64 floating-point fused-multiply-add *does not* do a rounding step between the multiply and addition step, whereas the FMUL, FADD sequence will. Hence the differences in results – which are expected.

This is deliberate as there are cases when the user doesn’t cares more about speed then accuracy, and even cases where not doing the extra rounding improves the results. Floating-point is weird.

LikeLiked by 1 person
- 4
  
  hoakley on July 19, 2021 at 3:28 pm
  
  Thank you.
  Yes, I make the point about losing the intermediate rounding right at the top. So I wasn’t expecting the cumulative error to be the same. But losing that rounding is a double-edged sword: while it eliminates the possibility of rounding error then, sometimes the rounding which is removed can reduce the overall error, as you say.
  While my cumulative error is rough and ready and open to improvement, my performance measurements are very robust and Mach-precise. There’s no evidence that using the fused instruction significantly improves performance over the separate instructions. What I’ve seen looks more like a small improvement in the pipeline than the fused instruction saving any cycles.
  The bottom line here is what should an assembly language coder do. While it’s fine to suggest for critical code that they should benchmark both options, from what I’ve seen here, I’d not recommend using the fused instruction, but separate ones. But I’m open to be convinced otherwise.
  Howard.
  
  LikeLike
5

ppzgs1@gmail.com on July 19, 2021 at 1:32 pm

A difference between the fused instruction and the two separate instructions is expected. The real question is is it getting correct results or not?

LikeLiked by 1 person
- 6
  
  hoakley on July 19, 2021 at 3:30 pm
  
  Thank you.
  I’ve been generously promised a formal error analysis, but the cumulative error seen above shows that the fused instruction accumulates significantly greater error in this instance. So it appears that, here at least, the two separate instructions are resulting in less error.
  Howard.
  
  LikeLike
7

uwu1 on July 21, 2021 at 8:02 am

I think it should say it’s a quadratic accumulation of error rather than exponential (growing as n^2 rather than 2^n)

LikeLiked by 1 person
- 8
  
  hoakley on July 21, 2021 at 8:24 am
  
  Thank you. I was careful in my wording not to use a strict sense of the word, but for the avoidance of any doubt have amended the text to be strictly accurate in any sense, I hope. Of course it isn’t strictly quadratic either, merely a power relationship. Hopefully my change avoids being misleading without becoming incomprehensible.
  Howard.
  
  LikeLike

Share this:

Related