Some apps and other code doesn’t appear to run faster on M1 chips, and some even runs more slowly. Could this be a result of it not using the best acceleration for vectors and matrices?
NEON
How ARM64 uses its special SIMD registers in lanes, and how they can be loaded with and without de-interleaving.