Code in ARM Assembly: Rounding and arithmetic

In the previous article in this series, I looked briefly at floating point formats on the ARM64, its register use and access, and conversions. I now move on to look at scalar arithmetic instructions, beginning with the question of rounding and state.


Unlike integers, floating point numbers can only represent relatively few numbers exactly. The higher the absolute value of any number, the larger the gaps become between those which are represented exactly. All in-between numbers have to be rounded to a close value, and there are several different ways of performing that rounding. IEEE754-2008 offers the choice of:

  • RU, or roundTowardPositive, chooses the next number up towards +∞
  • RD, or roundTowardNegative, chooses the next number down towards –∞
  • RZ, or roundTowardZero, chooses the next number closer to zero
  • roundTiesToEven chooses the next number whose least significant digit is even
  • roundTiesToAway chooses the next number whose magnitude is larger.

(IEEE754-2008 has now been superceded by its 2019 revision, which has been adopted as an International Standard at last.)

To illustrate the latter two, when rounding the significand ending in 4565, roundTiesToEven would choose that ending in 456, while roundTiesToAway would choose that ending in 457 instead. roundTiesToEven is normally the default for binary values. Although in most cases code leaves the rounding mode at its default, there may be reasons to change it, which is accomplished in the Floating Point Control Register (FPCR). I’ve previously listed variants of conversion instructions FRIN- and FCV- which offer different rounding modes to be used in those conversions.


Although relatively little-used except in specialised code such as maths libraries, these two registers support options which can be important when dealing with floating point.

The Floating Point Control Register FPCR is a 32-bit register which contains:

  • AHP, the Alternate Half Precision control bit, normally set at 0 to follow the IEEE specification.
  • DN, Default NaN enable, which controls the propagation and return of NaN (Not a Number) values.
  • FZ, Flush-to-Zero enable, which deviates from the IEEE specification by replacing subnormal numbers by zero.
  • RMODE, two control bits which set the rounding mode.
  • FZ16, which controls flush-to-zero mode for half precision format data.

To inspect or change the FPCR, copy it into an X register number n using
To copy it back and set any changes, use

The Floating Point Status Register FPSR is even less used, as most of its flags are concerned with AArch32 mode and numeric comparisons. In ARM64, those affect the global condition flag register NZCV flags instead, as I’ll explain when I look at comparisons. Other flags are set when saturation has occurred, in underflow or overflow, division by zero, or an invalid operation.

To inspect the FPSR, copy it into an X register number n using
and you can copy it back with

Rounding modes available in the FPCR include:

  • round to nearest, 00 in bits 22 and 23
  • roundTowardPositive, 01 bits
  • roundTowardNegative, 10 bits
  • roundTowardZero, 11 bits.

Higher-level languages provide access to rounding and other controls through NSDecimalNumberHandler in Foundation. In Swift, there is language-specific control in the enumeration FloatingPointRoundingRule, which supports

  • awayFromZero
  • down (roundToIntegralTowardNegative)
  • toNearestOrAwayFromZero (roundToInegralTiesToAway)
  • toNearestOrEven (roundToIntegralTiesToEven, IEEE 754 default)
  • towardZero (roundToIntegralTowardZero)
  • up (roundToIntegralTowardPositive)

Arithmetic instructions

Floating point arithmetic instructions are one of the simpler groups of ARM64 instructions, and cover all the expected arithmetic operations including square root, with the addition of some fused instructions which perform combinations of multiplication and addition. Here, I give each instruction using D registers; they can also be used with 32-bit S and 16-bit H registers, but different register sizes can’t be mixed in the same instruction.

Three instructions take the destination register and a single operand:

  • FABS D0, D1 returns the absolute value of the operand D1 in D0
  • FNEG D0, D1 returns the negative value of the operand D1 in D0
  • FSQRT D0, D1 returns the square root of the operand D1 in D0

The bulk of these take the destination register together with two operands:

  • FADD D0, D1, D2 returns the sum of the two operands D1 + D2 in D0
  • FSUB D0, D1, D2 returns the difference of the two operands D1 – D2 in D0
  • FMUL D0, D1, D2 returns the product of multiplying D1 x D2 in D0
  • FNMUL D0, D1, D2 returns the negation of the product of multiplying D1 x D2 in D0
  • FDIV D0, D1, D2 returns the result of dividing D1/D2 in D0
  • FMAX D0, D1, D2 returns the larger of D1 and D2 in D0
  • FMIN D0, D1, D2 returns the smaller of D1 and D2 in D0

Those instructions which fuse multiplication and addition (FMA) take the destination register and three operands, adopting the FMA4 rather than FMA3 pattern. The numbers to be multiplied are given in the first two of the three operands:

  • FMADD D0, D1, D2, D3 first multiplies D1 x D2, then adds D3 to that result, returning the result in D0
  • FMSUB D0, D1, D2, D3 first multiplies D1 x D2, negates that product, then adds D3 to that result, returning the result in D0
  • FNMADD D0, D1, D2, D3 first multiplies D1 x D2, negates that product, then subtracts D3 from that result, returning the result in D0
  • FNMSUB D0, D1, D2, D3 first multiplies D1 x D2, then subtracts D3 from that result, returning the result in D0

FMA is a complex area. By fusing two operations, one rounding is removed and should therefore improve the accuracy of the result when compared with separate instructions. There are also significant performance gains to be achieved, depending on the implementation. FMA is still poorly supported by optimisations in compilers, where the generation of code using FMA may be relegated to options considered ‘higher risk’. Floating point is sufficiently variable in performance that it may be wisest to examine the performance and error in code using separate and fused instructions before making any decision as to which to use in any particular application.

Today’s cheat sheet provides a simple summary:


and to tear out as a PDF: arm64fparithmetic1

If you’re wondering what instructions there are for trigonometric functions such as sine, or for other functions such as powers and logarithms, there aren’t any: they’re left to the programmer to address.

So far, I have only considered scalar floating point. ARM64 also supports single instructions on vector data, SIMD, also known as NEON, which I will try to explain in the next article in this series.

Previous articles in this series:

1: Building an app to develop assembly routines, including an explanation of calling assembly language from Swift, with a complete Xcode project
2: Registers explained
3: Working with pointers
4: Controlling flow
5: Conditional loops
6: Flow, pipelines and performance
7: Moving data around
8: Integer arithmetic
9: Bit operations
10: Conditions without branches
11: Floating point registers and conversions


Register summary
Operand architecture
Conditions and conditional branching instructions
Control Flow
Conditional selection
Instructions for GP registers
Floating point conversions 
Floating point arithmetic (scalar)
AsmAttic 2, a complete Xcode project (version 2)
AsmAttic, a complete Xcode project (version 1)


Procedure Call Standard for the Arm 64-bit Architecture (ARM) from Github
Writing ARM64 Code for Apple Platforms (Apple)
Stephen Smith (2020) Programming with 64-Bit ARM Assembly Language, Apress, ISBN 978 1 4842 5880 4.
Daniel Kusswurm (2020) Modern Arm Assembly Language Programming, Apress, ISBN 978 1 4842 6266 5.
ARM64 Instruction Set Reference (ARM).