The cores in the M1 and the chip itself are thoroughly Apple designs, and work hand-in-glove with macOS using techniques like out-of-order execution and hints to optimise performance.
ARM
Summary and links for the latest information about what’s in the current M1 chip, from differences in caches between cores, to the Matrix Coprocessor and Fabric limitations.
How ARM64 uses its special SIMD registers in lanes, and how they can be loaded with and without de-interleaving.
Three recent WWDC sessions extolling Apple’s “extensive reference material” and Xcode can’t find anything on these rich and extensive libraries.
More cores are great for running more processes, but how can you make individual operations within a process faster? SIMD is one solution.
Benchmarking 32-bit Float vector dot-product calculations using Swift, NEON assembly, and Apple’s SIMD libraries, on Intel and M1 Macs.
Details options available for rounding floating point numbers, and all the scalar floating point operations. There’s another cheat sheet summary too.
Floating point numbers are very different from integers, but are loaded and stored much the same. Conversion between registers, including to and from integers, is complex.
Where code can make simple selections according to a conditional test, it may be possible to eliminate branching and ensure rapid execution.
Many processors like the ARM64 have instructions to perform fused multiply-add operations. Do they deliver reduced error and better performance?