Code in ARM Assembly: Floating point registers and conversions

At last, I now move on to look at assembly language for floating point operations on ARM64 processors.

ARM64 processors conform to IEEE754-2008, therefore use datatypes which are quite different from integers, as they’re structured into bitfields crossing byte boundaries. Performing bitwise operations or extensions on them therefore makes no sense.


The floating point registers don’t check the format of their contents, and can be used to contain regular binary data. Sometimes there can be speed advantages in using floating point registers to load and store data which isn’t floating point, so don’t be surprised to see these registers being used for other types of data which are then processed using general-purpose registers. This is most likely with operations involving the 128-bit Q registers, which can be a convenient way of working with 128-bit integers. For example,
STR Q1, [X0, X1]
stores a 128-bit integer currently in Q1 in [X0 and X1].

There are 32 floating point registers, numbered from 0 to 31. Each can be accessed in five different sizes: V0 is the generic name for the first floating point register, which can be used as any of the following:

  • Q0 – 128-bit, C __float128
  • D0 – 64-bit, C double and long double, Swift Double, range +/- 2.23 x 10^-308 to +/- 1.80 x 10^308
  • S0 – 32-bit, C float, Swift Float, range +/- 1.17 x 10^-38 to +/- 3.4 x 10^38
  • H0 – 16 bit, C __fp16 or binary16.
  • (B0 – 8-bit, not used in practice.)

Note that ARM64 normally treats C long doubles as 128-bit, but Apple doesn’t, and retains them as 64-bit. For the sake of simplicity here, wherever appropriate I’ll use D versions to work in 64-bit Doubles.


While it’s easy to convert hexadecimal integers and grok them, native floating point format is opaque. Its basic scheme is to approximate each number using the form
m x ß^e
(m times beta to the power of e) where ß is the radix or base, which is a whole number of 2 or greater, m is the significand, whose absolute value is less than ß, and e is the exponent. As floating numbers can always be positive or negative, another key piece of data which has to be encoded is that sign. There’s no such thing as an unsigned floating point number (in this system).

Using this representation, floating point numbers can be normalised, in which their representation obeys the rule that not only is the absolute value of m < ß, but it’s also greater than or equal to 1. When the absolute value of m < 1, the number is said to be subnormal (or denormal). When normalised, the binary representation of every finite non-zero number is unique, which makes most things far simpler.

In the IEEE754 binary32 format, corresponding to Floats, the sign is represented by one bit, the exponent in 8 bits, and the significand by the remaining 23 bits. In binary64 format, for Doubles, there’s the single sign bit, an 11 bit exponent, and 52 bits for the significand. There are special representations for zeros and infinities, which are both signed, and for a value ‘Not a Number’, the dreaded NaN, which is generated by some operations such as dividing by zero.

Although these forms allow a huge range of floating point numbers to be represented, relatively few of them are given precisely. A lot of work has gone into improving precision, how results should be rounded to the most appropriate binary value, and how these and other factors contribute to error. I’ll refer to these as I discuss floating point instructions, and the compromises that have to be reached between minimising error and achieving good performance.

Passing floating point arguments

Remember that these are used according to the register group. For a C wrapper of
extern double burble(long, double, long, double)
the first long will be passed in X0, the first double in D0, the second long in X1, and the second double in D1, with the result being returned in D0, as it’s a double.

When floating point arguments are passed not by value but by reference, as in
extern void transform(*double, *double, *double)
or in Swift
transform(&x, &y, &z)
the addresses are passed not in floating point D registers, but in X registers (as they’re addresses, not floating point numbers).

Loading, moving, storing

Loading values into a floating point register, and storing them from a register, use the same families of instructions as for general-purpose registers:

  • LDR and LDP load floating point registers, but don’t have variants to support signed or unsigned data, of course.
  • STR and STP store data from floating point registers.

Memory addresses used in the operands of LDR and STR are naturally given in general-purpose registers.

Moving data between floating point registers, and between a floating point and a general-purpose register, is accomplished using the FMOV instruction. When both the source and destination are floating point registers, they must be of the same size, e.g. both D. The instruction is far more flexible when moving to or from a general-purpose register, when the two registers don’t need to be the same size. However, no conversion, sign- or zero-extension is performed: the raw binary floating point number is moved.


There are many instructions which convert the contents of floating point registers to and from floating point and general-purpose registers. These fall into three groups, depending on the direction of conversion. In each case, they’re used in the format
where Am is the destination register, and Bn is the source register.

Those which convert from a floating point register to a floating point register include:

  • FCVT and BFCVT, which convert from one size to another, such as 32-bit S and 64-bit D values.
  • FRINT32– and FRINT64-, which take a suffix of X or Y and round to 32-bit or 64-bit values within the same size of register.
  • The FRIN family, which round to integrals between H, S or D registers of the same size. These take a suffix to indicate how the rounding is to be performed, for example FRINTI uses the current rounding mode.

Those which convert from a floating point register to an integer in a general-purpose register are largely based on the FCV family, with instructions constructed using three parts. The first is the prefix FCV, following which are two letters to set the rounding mode (similar to the FRIN family, but a slightly smaller subset), with a suffix of S to generate a signed integer, or U for unsigned. These convert a value in an H, S or D register to an integer in a W or X general-purpose register.

There’s also one special instruction FJCVTZS for converting from a 64-bit floating point D register to a 32-bit signed integer in a general-purpose W register, rounding towards zero, which is primarily intended for use with JavaScript.

The final group of conversions take an integer in a W or X general-purpose register, and convert it into 16-, 32- or 64-bit floating point form in an H, S or D register, using the rounding mode in the FPCR. SCVTF converts a signed integer, and UCVTF an unsigned one.

Because of the number and opacity of these instructions, and the fact that they’re not well-covered elsewhere, here’s a graphical summary:


and a tear-out PDF: armfpconversions1

Putting them together

My earlier example of passing arguments and moving data around is a good illustration of the use of these instructions. In C it might be declared as
extern double testadd(double, double*, double*);
which takes one double as a value and two as pointers, and returns a double value. To call that in Swift, use code like
let myA = theA.doubleValue
var myB = theB.doubleValue
let theTemp = theC.doubleValue
var myC = [theTemp, (theTemp + 1.0), (theTemp + 2.0)]
let myD = testadd(myA, &myB, &myC)

This first sets up the three arguments to contain an immutable Double value, a pointer to a Double, and a pointer to a three-element array of Doubles, before calling that function to return a Double result. You then display the results using
self.outputText.string = "Result = \(myD) a = \(myA) b = \(myB) c = \(myC)\n"

My assembly code then reads:
.global _testadd
.align 4

STR LR, [SP, #-16]!
// – that’s a labelled value using PC-relative access
FMUL D6, D0, D5 // – the first argument, a Double, is accessed from the D0 register
LDR D7, [X0] // – that uses base register access. Note that the address of a Double is passed not in a floating point register, but in a general-purpose register.
FMUL D7, D7, D5
STR D7, [X0]
LDR D4, [X1]
FMUL D7, D4, D5
STR D7, [X1]
LDR D4, [X1,8]!
// – that uses pre-indexing to increment the address in X1 by 8, so accessing the next Double in the array.
FMUL D7, D4, D5
STR D7, [X1]
LDR D4, [X1,8]!
FMUL D7, D4, D5
STR D7, [X1]
// – the result is returned as a Double value in the D0 register.
LDR LR, [SP], #16

MULT_TWO: .double 2.010203
MULT_THREE: .double 3.020304

In the next article, I’ll start to look at those floating point arithmetic instructions in more detail, beginning with the question of rounding and state.

Previous articles in this series:

1: Building an app to develop assembly routines, including an explanation of calling assembly language from Swift, with a complete Xcode project
2: Registers explained
3: Working with pointers
4: Controlling flow
5: Conditional loops
6: Flow, pipelines and performance
7: Moving data around
8: Integer arithmetic
9: Bit operations
10: Conditions without branches


ARM register summary
ARM operand architecture
Conditions and conditional branching instructions
Control Flow
ARM conditional selection
ARM instructions for GP registers
ARM Floating point conversions 
AsmAttic 2, a complete Xcode project (version 2)
AsmAttic, a complete Xcode project (version 1)


Procedure Call Standard for the Arm 64-bit Architecture (ARM) from Github
Writing ARM64 Code for Apple Platforms (Apple)
Stephen Smith (2020) Programming with 64-Bit ARM Assembly Language, Apress, ISBN 978 1 4842 5880 4.
Daniel Kusswurm (2020) Modern Arm Assembly Language Programming, Apress, ISBN 978 1 4842 6266 5.
ARM64 Instruction Set Reference (ARM).