Over the last nine months, a great deal of work has gone into discovering just what is in Apple’s M1 chip, and what it all does. As Apple prepares to announce its successor in the next few weeks, I thought it might be worth surveying the work which has been accomplished so far, to establish the baseline for future Apple Silicon chips.
The M1 chip combines many of the features which have previously been implemented in multiple chips, from the CPU and GPU, which have commonly come from different designers, to all the specialist interfaces for peripherals such as storage controllers. The M1 doesn’t integrate main memory, though, which is contained within the package. It’s also unusual in incorporating many coprocessors in addition to the GPU, which provide state-of-the-art facilities for features such as deep learning using neural networks.
This contrasts with Intel’s Northbridge and Southbridge chipsets, in which Northbridge is the memory controller hub interfacing with the CPU, main memory and PCI-Express video cards, and Southbridge is the platform controller hub interfacing with slower and peripheral systems. In Sandy Bridge, Northbridge is integrated with the CPU in a SoC of more modest aspirations.
At its heart, each M1 chip has a total of eight processor cores, all based on Apple’s development of technology licensed from Arm. Four are described as Performance cores, dubbed Firestorm, and four are Efficiency cores, or Icestorm. These primarily differ in their compromise between performance and power consumption, with Firestorm cores performing in the same class as better Intel cores, and Icestorm delivering lower performance with much less power requirement and heat production. There are differences in the provision of cache memory, though:
- Firestorm has 192+128 KB L1 cache per core, and shares a total of 12 MB L2 cache.
- Icestorm has 128+64 KB L1 cache per core, and shares a total of 4 MB L2 cache.
Both types of core support the ARMv8.4-A instruction set, with Neon (128-bit register) SIMD instructions for parallel processing of both integer and floating-point operations. However, they don’t support Arm’s Scalable Vector Extension (SVE) SIMD instructions. Each core features 8-wide decode to multiple integer execution units and four floating-point/SIMD units. Integer and floating-point operations, including SIMD, are fully supported in compiled high-level and assembly languages.
Ordinarily, apps don’t get to choose which type of core their processes run on. For background tasks, though, the code can choose a Quality of Service (QoS) which determines how macOS allocates them to cores. Most macOS background tasks and services, such as making Time Machine backups, are allocated exclusively to the Efficiency cores, leaving the Performance cores free for user tasks. Such asymmetric multiprocessing is characteristic of the M1, and seldom seen elsewhere.
Documentation is extensive, including:
Andrei Frumusanu on AnandTech
ARMv8 Instruction Set Overview (Arm, PDF)
Arm A64 Instruction Set Architecture (Arm)
Arm Architecture Reference Manual (Arm)
Neon Programmer’s Guide (Arm)
Dougall Johnson’s detailed survey of the cores and their instructions.
This consists of eight cores (seven in cheaper models), each with 16 Execution Units, each of which has 8 Arithmetic Logic Units, making a total of 1024 ALUs in total, which are capable of executing 24,576 threads simultaneously. Access to the GPU is provided to third-parties through calls in the Metal and related frameworks.
Central to the M1 design model is the use of a common pool of memory, which Apple terms Unified memory, by CPU cores, GPUs, and some of the coprocessors. In the M1, the memory is LPDDR4X SDRAM in either 8 or 16 GB configuration, seen to the right of the M1 in the image above, and cannot be expanded. Unified memory eliminates the need to transfer data between main (CPU) memory and that provided for coprocessors, such as video memory in a graphics card. This is a reversal of the previous pattern, in which main memory was only used by low-end GPUs, but follows a recent trend relying on fast-access shared memory and tight integration of the GPU.
Although Apple has revealed some of the specialist coprocessors in the M1, most are undocumented, and can’t be directly accessed by engineers other than Apple’s.
Best-known is Apple’s Neural Engine (ANE), described as containing 16 ‘cores’ which are thought to be 16-wide Kernel direct memory-access (DMA) engines, with a shared 4 MB L2 cache. This apparently works with 5-dimensional tensors (a generalisation of vectors) containing three data types: 8-bit signed and unsigned integers, and 16-bit ‘half-word’ floating point values. These features are accessed only through Core ML, and perform neural network operations such as convolutions and matrix multiplication, but at low precision compared with numeric coprocessors. Geohot’s outline information provides further details.
Apple’s Matrix Coprocessor (AMX) is less understood. This appears aimed at improving performance on larger matrix operations than those which can be handled by Neon SIMD in the CPU, using floating-point types offering higher precision that those used by the Neural Engine. Dougall Johnson provides further details, gleaned from examining code which uses AMX instructions via the Accelerate libraries, which provide the interface for third-parties. Each M1 chip is believed to have just one AMX coprocessor shared between its eight CPU cores.
The Display Coprocessor (DCP) sits between the CPU and the Display Controller, and has been investigated by the team working on Asahi Linux.
Further coprocessors are thought to include:
- Always On Processor (AOP), which handles environmental sensors and activation of the system;
- Apple Video Decoder (AVD), which decodes video;
- Apple Video Encoder (AVE), which encodes video;
- Power Management Processor (PMP), the successor to power management features in the SMC.
- Apple Graphics (AGX), which may be a synonym for the GPU;
- Secure Enclave Processor (SEP), whose functions are described in the Apple Platform Security Guide.
Each runs its own real-time operating system, RTKit, and loads substantial firmware during the boot process.
This is a notional section of the M1 which may contain some of the ASCs, and performs I/O functions analogous to the Intel Southbridge. These are accessed via the Device Address Resolution Table (DART), which provides custom I/O memory management. Controllers are known to include:
- Display, which supports only one external display;
- Thunderbolt 4 and USB 4, which apparently don’t involve Intel designs, and currently only support two Thunderbolt ports;
- NVMe and Gen 4 PCI-Express, including the controller for the internal SSD, which includes hardware encryption;
- Wi-Fi, Ethernet and Bluetooth networking.