At last, I now move on to look at assembly language for floating point operations on ARM64 processors.

ARM64 processors conform to IEEE754-2008, therefore use datatypes which are quite different from integers, as they’re structured into bitfields crossing byte boundaries. Performing bitwise operations or extensions on them therefore makes no sense.

**Registers**

The floating point registers don’t check the format of their contents, and can be used to contain regular binary data. Sometimes there can be speed advantages in using floating point registers to load and store data which isn’t floating point, so don’t be surprised to see these registers being used for other types of data which are then processed using general-purpose registers. This is most likely with operations involving the 128-bit Q registers, which can be a convenient way of working with 128-bit integers. For example,

`STR Q1, [X0, X1]`

stores a 128-bit integer currently in Q1 in [X0 and X1].

There are 32 floating point registers, numbered from 0 to 31. Each can be accessed in five different sizes: V0 is the generic name for the first floating point register, which can be used as any of the following:

- Q0 – 128-bit, C __float128
- D0 – 64-bit, C double and long double, Swift Double, range +/- 2.23 x 10^-308 to +/- 1.80 x 10^308
- S0 – 32-bit, C float, Swift Float, range +/- 1.17 x 10^-38 to +/- 3.4 x 10^38
- H0 – 16 bit, C __fp16 or binary16.
- (B0 – 8-bit, not used in practice.)

Note that ARM64 normally treats C long doubles as 128-bit, but Apple doesn’t, and retains them as 64-bit. For the sake of simplicity here, wherever appropriate I’ll use D versions to work in 64-bit Doubles.

**Formats**

While it’s easy to convert hexadecimal integers and grok them, native floating point format is opaque. Its basic scheme is to approximate each number using the form

`m x ß^e`

(m times beta to the power of e) where ß is the **radix** or base, which is a whole number of 2 or greater, `m`

is the **significand**, whose absolute value is less than ß, and `e`

is the **exponent**. As floating numbers can always be positive or negative, another key piece of data which has to be encoded is that **sign**. There’s no such thing as an unsigned floating point number (in this system).

Using this representation, floating point numbers can be **normalised**, in which their representation obeys the rule that not only is the absolute value of `m`

< ß, but it’s also greater than or equal to 1. When the absolute value of `m`

< 1, the number is said to be **subnormal** (or denormal). When normalised, the binary representation of every finite non-zero number is unique, which makes most things far simpler.

In the IEEE754 binary32 format, corresponding to Floats, the sign is represented by one bit, the exponent in 8 bits, and the significand by the remaining 23 bits. In binary64 format, for Doubles, there’s the single sign bit, an 11 bit exponent, and 52 bits for the significand. There are special representations for zeros and infinities, which are both signed, and for a value ‘Not a Number’, the dreaded NaN, which is generated by some operations such as dividing by zero.

Although these forms allow a huge range of floating point numbers to be represented, relatively few of them are given precisely. A lot of work has gone into improving precision, how results should be rounded to the most appropriate binary value, and how these and other factors contribute to error. I’ll refer to these as I discuss floating point instructions, and the compromises that have to be reached between minimising error and achieving good performance.

**Passing floating point arguments**

Remember that these are used according to the register group. For a C wrapper of

`extern double burble(long, double, long, double)`

the first long will be passed in X0, the first double in D0, the second long in X1, and the second double in D1, with the result being returned in D0, as it’s a double.

When floating point arguments are passed not by value but by reference, as in

`extern void transform(*double, *double, *double)`

or in Swift

`transform(&x, &y, &z)`

the addresses are passed not in floating point D registers, but in X registers (as they’re addresses, not floating point numbers).

**Loading, moving, storing**

Loading values into a floating point register, and storing them from a register, use the same families of instructions as for general-purpose registers:

`LDR`

and`LDP`

load floating point registers, but don’t have variants to support signed or unsigned data, of course.`STR`

and`STP`

store data from floating point registers.

Memory addresses used in the operands of LDR and STR are naturally given in general-purpose registers.

Moving data between floating point registers, and between a floating point and a general-purpose register, is accomplished using the `FMOV`

instruction. When both the source and destination are floating point registers, they must be of the same size, e.g. both D. The instruction is far more flexible when moving to or from a general-purpose register, when the two registers don’t need to be the same size. However, no conversion, sign- or zero-extension is performed: the raw binary floating point number is moved.

**Conversions**

There are many instructions which convert the contents of floating point registers to and from floating point and general-purpose registers. These fall into three groups, depending on the direction of conversion. In each case, they’re used in the format

`INSTR Am, Bn`

where Am is the destination register, and Bn is the source register.

Those which convert from a floating point register to a floating point register include:

`FCVT`

and`BFCVT`

, which convert from one size to another, such as 32-bit S and 64-bit D values.`FRINT32`

– and`FRINT64`

-, which take a suffix of X or Y and round to 32-bit or 64-bit values within the same size of register.- The
`FRIN`

family, which round to integrals between H, S or D registers of the same size. These take a suffix to indicate how the rounding is to be performed, for example`FRINTI`

uses the current rounding mode.

Those which convert from a floating point register to an integer in a general-purpose register are largely based on the `FCV`

family, with instructions constructed using three parts. The first is the prefix `FCV`

, following which are two letters to set the rounding mode (similar to the `FRIN`

family, but a slightly smaller subset), with a suffix of `S`

to generate a signed integer, or `U`

for unsigned. These convert a value in an H, S or D register to an integer in a W or X general-purpose register.

There’s also one special instruction `FJCVTZS`

for converting from a 64-bit floating point D register to a 32-bit signed integer in a general-purpose W register, rounding towards zero, which is primarily intended for use with JavaScript.

The final group of conversions take an integer in a W or X general-purpose register, and convert it into 16-, 32- or 64-bit floating point form in an H, S or D register, using the rounding mode in the FPCR. `SCVTF`

converts a signed integer, and `UCVTF`

an unsigned one.

Because of the number and opacity of these instructions, and the fact that they’re not well-covered elsewhere, here’s a graphical summary:

and a tear-out PDF: armfpconversions1

**Putting them together**

My earlier example of passing arguments and moving data around is a good illustration of the use of these instructions. In C it might be declared as

`extern double testadd(double, double*, double*);`

which takes one double as a value and two as pointers, and returns a double value. To call that in Swift, use code like

`let myA = theA.doubleValue`

var myB = theB.doubleValue

let theTemp = theC.doubleValue

var myC = [theTemp, (theTemp + 1.0), (theTemp + 2.0)]

let myD = testadd(myA, &myB, &myC)

This first sets up the three arguments to contain an immutable Double value, a pointer to a Double, and a pointer to a three-element array of Doubles, before calling that function to return a Double result. You then display the results using

`self.outputText.string = "Result = \(myD) a = \(myA) b = \(myB) c = \(myC)\n"`

My assembly code then reads:

`.global _testadd`

.align 4

`_testadd:`

// – that’s a labelled value using PC-relative access

STR LR, [SP, #-16]!

LDR D5, MULT_TWO

` FMUL D6, D0, D5`

// – the first argument, a Double, is accessed from the D0 register

` LDR D7, [X0]`

// – that uses base register access. Note that the address of a Double is passed not in a floating point register, but in a general-purpose register.

` FMUL D7, D7, D5`

// – that uses pre-indexing to increment the address in X1 by 8, so accessing the

STR D7, [X0]

LDR D5, MULT_THREE

LDR D4, [X1]

FMUL D7, D4, D5

STR D7, [X1]

LDR D4, [X1,8]!*next* Double in the array.

` FMUL D7, D4, D5`

// – the result is returned as a Double value in the D0 register.

STR D7, [X1]

LDR D4, [X1,8]!

FMUL D7, D4, D5

STR D7, [X1]

FMOV D0, D6

` LDR LR, [SP], #16`

RET

`MULT_TWO: .double 2.010203`

MULT_THREE: .double 3.020304

In the next article, I’ll start to look at those floating point arithmetic instructions in more detail, beginning with the question of rounding and state.

**Previous articles in this series:**

1: Building an app to develop assembly routines, including an explanation of calling assembly language from Swift, with a complete Xcode project

2: Registers explained

3: Working with pointers

4: Controlling flow

5: Conditional loops

6: Flow, pipelines and performance

7: Moving data around

8: Integer arithmetic

9: Bit operations

10: Conditions without branches

**Downloads:**

ARM register summary

ARM operand architecture

Conditions and conditional branching instructions

Control Flow

ARM conditional selection

ARM instructions for GP registers

ARM Floating point conversions

AsmAttic 2, a complete Xcode project (version 2)

AsmAttic, a complete Xcode project (version 1)

**References**

Procedure Call Standard for the Arm 64-bit Architecture (ARM) from Github

Writing ARM64 Code for Apple Platforms (Apple)

Stephen Smith (2020) *Programming with 64-Bit ARM Assembly Language,* Apress, ISBN 978 1 4842 5880 4.

Daniel Kusswurm (2020) *Modern Arm Assembly Language Programming,* Apress, ISBN 978 1 4842 6266 5.

ARM64 Instruction Set Reference (ARM).