How ARM64 uses its special SIMD registers in lanes, and how they can be loaded with and without de-interleaving.
Swift
Benchmarking 32-bit Float vector dot-product calculations using Swift, NEON assembly, and Apple’s SIMD libraries, on Intel and M1 Macs.
Details options available for rounding floating point numbers, and all the scalar floating point operations. There’s another cheat sheet summary too.
Floating point numbers are very different from integers, but are loaded and stored much the same. Conversion between registers, including to and from integers, is complex.
Where code can make simple selections according to a conditional test, it may be possible to eliminate branching and ensure rapid execution.
Many processors like the ARM64 have instructions to perform fused multiply-add operations. Do they deliver reduced error and better performance?
An overview of bit operations, including MOVK for 16-bit immediate values, bit shift operations, bitwise AND, OR, XOR, and more, plus a cheat sheet.
Basic integer arithmetic – add, subtract, negate, multiply, multiply-and-add, and divide – in their many variations. With some catches for those more used to high-level languages.
Explaining the LDR family of instructions for loading registers, MOV for moving one register to another, STR for storing to memory, and SXTx/UXTx for filling a register with smaller data types.
How conditional branching can slow modern processors down badly, comparing assembly code with that generated by Apple’s Swift compiler, and some puzzles.