In less than a day, MacSysAdmin 2022 goes live. To supplement my video presentation there, I’m starting a series here looking in more detail at getting the most out of Apple silicon chips in terms of performance.
Until Apple released its first M1 Mac two years ago, conventional wisdom was that you made faster computers by giving them more processor cores and hiving off some functions, notably graphics, in semi-independent systems such as GPUs. Apple already had experience of a radically different approach that it has been using in its mobile devices, where GPUs and other co-processors are tightly integrated, and there are two types of CPU core.
In 2011, ARM had introduced a new architecture designed to use a combination of high-performance and low-power cores, big.LITTLE. Initially aimed primarily at devices running on battery power, it was first used for what’s termed clustered switching, which switches all active processes between one cluster of Efficiency (E) cores and another cluster of Performance (P) cores, according to demand. This was introduced into Qualcomm Snapdragon 615 and 808 chips from late 2014, and in Apple’s A10 Fusion from September 2016.
big.LITTLE also offered an alternative mode with global task scheduling, or heterogeneous multi-processing (HMP), in which a mixture of P and E cores are available at all times, with threads allocated to different types of core by a kernel scheduler. Apple introduced its implementation in the A11 Bionic in September 2017, which was its first chip to be able to use all six of its cores simultaneously.
The advantage of big.LITTLE in battery-powered mobile devices quickly became obvious. Whenever possible, tasks can be scheduled on E cores, so minimising their power consumption. When required, the power of P cores is available almost instantly, delivering the performance the user expects, but with the penalty of reducing endurance on battery.
In normal use, a mobile device using a combination of E and P cores in HMP will therefore try to run as many of its threads as possible on its E cores, only bringing in the P cores when necessary. This strategy is reflected in the design of recent Apple silicon chips for mobile devices, which normally have 4 E and 2 P cores, as in A13 to A16 Bionic chips.
Even for Mac notebook computers, Apple needed to adopt a different model. While battery endurance is important, it would have been disastrous if Apple silicon models like the MacBook Air performed worse than their Intel predecessors. Thus the first tranche of Apple silicon Macs with the basic M1 chip have equal numbers of E and P cores, in two clusters of four, an arrangement anticipated in ARM’s big.LITTLE architecture. Scheduling of threads on the two core types is then performed by a kernel scheduler, not by hardware.
This kernel scheduler is designed to keep background services to the E cores as much as possible, so that user processes can use most of the performance available in the P cores. It’s easy to watch this at work on an Apple silicon Mac in the CPU History window of Activity Monitor.
This example is taken from an M1 Pro chip, which has a different complement of cores from the basic M1 and M2 chips, with only 2 E cores but a total of 8 P cores. Using its P cores alone, maximum CPU performance of an M1 Pro is thus twice that of the basic M1. Although it only has half the number of E cores, macOS manages those differently from the four in the basic M1 to ensure that, when running background tasks, an M1 Pro is no slower than the basic M1.
These ten cores are arranged in three clusters: cores 1 and 2 form a two-core E cluster, shown at the top of the window; cores 3-6 form the first four-core P cluster, in the middle; cores 7-10 form the second four-core P cluster, at the bottom. For each core shown, the most recent activity is at the right, and here has just fallen close to idle across all ten cores.
This window shows a sequence of different threads typical of everyday use of an M1 Pro. This period starts (at the left) with a Time Machine backup putting a heavy load on the two E cores, followed by waves of activity cleaning up old backups. For much of the period, the two clusters of P cores are inactive, running at their idle frequency of 600 MHz and using precious little power. Light activity is largely confined to the first P cluster, allowing the second cluster to idle most of the time.
Towards the right is a short period of heavy workload on five of the P cores, when they ran four threads of compression for a user app at high priority. Because this was run on otherwise idle P cores, the background service running on the E cores didn’t affect compression performance at all. In spite of running this compression task as quickly as it would on the four P cores of a basic M1, this M1 Pro still has half the potential of its P cores free, should the user want to run additional processes.
Earlier this year, Intel started shipping its first chips to use ARM’s big.LITTLE architecture, in its Alder Lake generation. These range from Celeron chips with a single P and four E cores up to Core i9 with 8 P and 8 E cores. One major difference is that, instead of relying on kernel scheduling, these primarily rely on Intel Thread Director (ITD), developed using Machine Learning and only assisted by input from the operating system. In practice this limits control by code or the user, and isn’t as flexible as the kernel scheduling originally envisaged in the big.LITTLE architecture and implemented in Apple silicon chips.
Once my video for MacSysAdmin 2022 is available, I’ll post a follow-up article linking to it.