hoakley August 24, 2021 Macs, Technology

What’s in an M1 chip, and what does it do differently?

Over the last nine months, a great deal of work has gone into discovering just what is in Apple’s M1 chip, and what it all does. As Apple prepares to announce its successor in the next few weeks, I thought it might be worth surveying the work which has been accomplished so far, to establish the baseline for future Apple Silicon chips.

The M1 chip combines many of the features which have previously been implemented in multiple chips, from the CPU and GPU, which have commonly come from different designers, to all the specialist interfaces for peripherals such as storage controllers. The M1 doesn’t integrate main memory, though, which is contained within the package. It’s also unusual in incorporating many coprocessors in addition to the GPU, which provide state-of-the-art facilities for features such as deep learning using neural networks.

This contrasts with Intel’s Northbridge and Southbridge chipsets, in which Northbridge is the memory controller hub interfacing with the CPU, main memory and PCI-Express video cards, and Southbridge is the platform controller hub interfacing with slower and peripheral systems. In Sandy Bridge, Northbridge is integrated with the CPU in a SoC of more modest aspirations.

CPU

At its heart, each M1 chip has a total of eight processor cores, all based on Apple’s development of technology licensed from Arm. Four are described as Performance cores, dubbed Firestorm, and four are Efficiency cores, or Icestorm. These primarily differ in their compromise between performance and power consumption, with Firestorm cores performing in the same class as better Intel cores, and Icestorm delivering lower performance with much less power requirement and heat production. There are differences in the provision of cache memory, though:

Firestorm has 192+128 KB L1 cache per core, and shares a total of 12 MB L2 cache.
Icestorm has 128+64 KB L1 cache per core, and shares a total of 4 MB L2 cache.

Both types of core support the ARMv8.4-A instruction set, with Neon (128-bit register) SIMD instructions for parallel processing of both integer and floating-point operations. However, they don’t support Arm’s Scalable Vector Extension (SVE) SIMD instructions. Each core features 8-wide decode to multiple integer execution units and four floating-point/SIMD units. Integer and floating-point operations, including SIMD, are fully supported in compiled high-level and assembly languages.

Ordinarily, apps don’t get to choose which type of core their processes run on. For background tasks, though, the code can choose a Quality of Service (QoS) which determines how macOS allocates them to cores. Most macOS background tasks and services, such as making Time Machine backups, are allocated exclusively to the Efficiency cores, leaving the Performance cores free for user tasks. Such asymmetric multiprocessing is characteristic of the M1, and seldom seen elsewhere.

Documentation is extensive, including:
Andrei Frumusanu on AnandTech
ARMv8 Instruction Set Overview (Arm, PDF)
Arm A64 Instruction Set Architecture (Arm)
Arm Architecture Reference Manual (Arm)
Neon Programmer’s Guide (Arm)
Dougall Johnson’s detailed survey of the cores and their instructions.

GPU

This consists of eight cores (seven in cheaper models), each with 16 Execution Units, each of which has 8 Arithmetic Logic Units, making a total of 1024 ALUs in total, which are capable of executing 24,576 threads simultaneously. Access to the GPU is provided to third-parties through calls in the Metal and related frameworks.

Documentation now includes:
Dougall Johnson’s detailed account
Alyssa Rosenzweig’s series on the GPU.

Unified memory

Central to the M1 design model is the use of a common pool of memory, which Apple terms Unified memory, by CPU cores, GPUs, and some of the coprocessors. In the M1, the memory is LPDDR4X SDRAM in either 8 or 16 GB configuration, seen to the right of the M1 in the image above, and cannot be expanded. Unified memory eliminates the need to transfer data between main (CPU) memory and that provided for coprocessors, such as video memory in a graphics card. This is a reversal of the previous pattern, in which main memory was only used by low-end GPUs, but follows a recent trend relying on fast-access shared memory and tight integration of the GPU.

Coprocessors (ASCs)

Although Apple has revealed some of the specialist coprocessors in the M1, most are undocumented, and can’t be directly accessed by engineers other than Apple’s.

Best-known is Apple’s Neural Engine (ANE), described as containing 16 ‘cores’ which are thought to be 16-wide Kernel direct memory-access (DMA) engines, with a shared 4 MB L2 cache. This apparently works with 5-dimensional tensors (a generalisation of vectors) containing three data types: 8-bit signed and unsigned integers, and 16-bit ‘half-word’ floating point values. These features are accessed only through Core ML, and perform neural network operations such as convolutions and matrix multiplication, but at low precision compared with numeric coprocessors. Geohot’s outline information provides further details.

Apple’s Matrix Coprocessor (AMX) is less understood. This appears aimed at improving performance on larger matrix operations than those which can be handled by Neon SIMD in the CPU, using floating-point types offering higher precision that those used by the Neural Engine. Dougall Johnson provides further details, gleaned from examining code which uses AMX instructions via the Accelerate libraries, which provide the interface for third-parties. Each M1 chip is believed to have just one AMX coprocessor shared between its eight CPU cores.

The Display Coprocessor (DCP) sits between the CPU and the Display Controller, and has been investigated by the team working on Asahi Linux.

Further coprocessors are thought to include:

Always On Processor (AOP), which handles environmental sensors and activation of the system;
Apple Video Decoder (AVD), which decodes video;
Apple Video Encoder (AVE), which encodes video;
Power Management Processor (PMP), the successor to power management features in the SMC.
Apple Graphics (AGX), which may be a synonym for the GPU;
Secure Enclave Processor (SEP), whose functions are described in the Apple Platform Security Guide.

Each runs its own real-time operating system, RTKit, and loads substantial firmware during the boot process.

Fabric

This is a notional section of the M1 which may contain some of the ASCs, and performs I/O functions analogous to the Intel Southbridge. These are accessed via the Device Address Resolution Table (DART), which provides custom I/O memory management. Controllers are known to include:

Display, which supports only one external display;
Thunderbolt 4 and USB 4, which apparently don’t involve Intel designs, and currently only support two Thunderbolt ports;
NVMe and Gen 4 PCI-Express, including the controller for the internal SSD, which includes hardware encryption;
Wi-Fi, Ethernet and Bluetooth networking.

23Comments

Add yours

1

Oliver Busch on August 24, 2021 at 6:58 am

Awesome article. So many links for the must-read-list.
What I find really interesting as well: I read somewhere that some key function(s) used by the Rosetta 2 translation layer are implemented in the hardware, contributing to the outstanding performance.

LikeLiked by 1 person
- 2
  
  hoakley on August 24, 2021 at 8:47 am
  
  Thank you.
  I think you may be referring to the processor’s ability to change endianness, which has long been a feature of Arm processors but does come in handy for such purposes. I’m not aware of any instruction being added for this specific purpose, though.
  Howard.
  
  LikeLike
  - 3
    
    Andrew Reilly on September 2, 2021 at 5:52 am
    
    I believe that the big hardware-assist for Rosetta is an ability to switch to a memory consistency model that is compatible with Intel processors. Arm’s architecture prescribes a more relaxed model, which means that accurate emulation/translation of intel code requires the use of many memory barrier instructions, which slow most things down. Picking up that trick in hardware is a very cool idea, if modelling intel code is something your processor might be asked to do…
    
    LikeLiked by 2 people
    - 4
      
      hoakley on September 2, 2021 at 9:10 pm
      
      Thank you – I did eventually work that out. I’m not sure that anyone has discovered how it’s done though, have they? It certainly isn’t through extending the instruction set, which is one thing Apple apparently can’t do.
      Howard.
      
      LikeLiked by 1 person
- 5
  
  myfanwy123 on August 24, 2021 at 3:59 pm
  
  I suspect Oliver is referring to the M1’s unique (for ARM) Intel-like memory ordering mode. This allows Rosetta 2 to run multithreaded Intel code much more efficiently, than would otherwise be possible with ARM’s “native” memory order.
  
  LikeLiked by 2 people
  - 6
    
    hoakley on August 24, 2021 at 5:12 pm
    
    Thank you. I think you’re here referring to what is generally known as endianness – the order of bytes as they’re stored in multi-byte data in memory. If so, this is an urban myth.
    Intel processors are little-endian, but Motorola 68K processors are big-endian. So switching between them requires some means of re-ordering bytes.
    Arm processors were originally little-endian, but for quite a few years have been bi-endian, which means they can be run in either mode, and switched between them. But there’s no need for that: Apple has always run its Arm processors in iPhones etc. in little-endian mode, and that’s the default mode in the M1 too. As Apple succinctly puts it:
    “Both Apple silicon and Intel-based Mac computers use the little-endian format for data, so you don’t need to make endian conversions in your code.”
    So little-endian isn’t unique for Arm, it was the original mode for Arm processors, and since they became bi-endian, both modes have been available on all Arm processors, not just Apple’s.
    Or have I misunderstood, and you’re referring to something different?
    Howard.
    
    LikeLike
  - 7
    
    hoakley on August 24, 2021 at 7:35 pm
    
    Ah – I think I’ve found the reference, it’s to Intel-style TSO memory ordering, versus the weaker ordering normally used on Arm processors. If you look at Yining Karl Li’s accounts, Apple’s Arm designs aren’t the only Arm processors to have custom memory ordering options. As this is fairly deep internally, I don’t think it’s understood exactly what Apple has done in this implementation, and the evidence is that it’s only used for Rosetta 2, and not useful for running native code. So, yes, it’s important during the transition, it seems, but outside of Rosetta 2 has no value.
    It’s also worth reminding ourselves that the cores in the M1 aren’t Arm processors, they’re Apple’s through and through. Arm licences some of the technology, Apple designs the processors, and they don’t conform to any of those available from Arm itself. For all we know, some of the code for instance to obtain square roots could be very different from that used by Arm.
    Howard.
    
    LikeLike
8

Michele Galvagno on August 24, 2021 at 7:21 am

This is an incredible account! Thank you Howard

Michele Galvagno

LikeLiked by 1 person
- 9
  
  hoakley on August 24, 2021 at 8:47 am
  
  Thank you, Michele.
  Howard.
  
  LikeLike
10

Colstan on August 24, 2021 at 9:49 am

Apple has gotten a lot of flak recently for their privacy and political stances, deservedly so. I’ve gotten so fed up with it that I’ve been half-heartedly looking at potentially building a PC and investigating Linux distros, just in case I have to leave the platform, which I would prefer not to do.

It’s articles like this that remind me why I use a Mac in the first place. Apple’s engineering departments are the best in the industry and the M1 is an excellent example. It’s ultimately the best in class integration between the hardware and software that makes the Mac special, and in my opinion, better than anything on the PC side. That doesn’t make me any less upset about the direction that Apple’s leadership is taking, but it’s a good reminder of what Apple is capable of when it concentrates on technology over social engineering.

Anyway, another excellent article, I always learn a great deal.

LikeLiked by 1 person
- 11
  
  hoakley on August 24, 2021 at 11:37 am
  
  Thank you.
  Howard.
  
  LikeLike
12

jlforrest on August 25, 2021 at 3:07 am

A minor nit concerning the sentence “The M1 doesn’t integrate main memory, though, which is contained within the package”. To me, everything that is contained within the package is also integrated in the package.

LikeLiked by 1 person
- 13
  
  hoakley on August 25, 2021 at 7:11 am
  
  Thank you. That paragraph explains what is integrated into the chip (the opening words), and makes the clear distinction that the main memory isn’t integrated into the chip but the package.
  Howard.
  
  LikeLike
14

Pico on August 25, 2021 at 6:53 am

One thing Apple has done in pretty much all of their marketing for these first Apple Silicon Macs is to say things like “Two Thunderbolt / USB 4 ports”.

This surely makes it sound like they mean “Thunderbolt 4” and “USB 4” (as you wrote). But, that’s not actually the case. It’s actually Thunderbolt 3 / USB 4. Luckily, these actual specifics are correctly noted in each products Tech Specs (https://www.apple.com/macbook-pro-13/specs/).

I think they have combined these ports like this because Thunderbolt 3 and USB 4 have basically the same specifications, but it makes things very confusing and easy to misunderstand.

LikeLiked by 1 person
- 15
  
  hoakley on August 25, 2021 at 7:06 am
  
  Thank you, Pico. I think the problem is even deeper: as I understand it, the key feature of Thunderbolt 4 is support for USB4, which isn’t a feature of Thunderbolt 3. So Thunderbolt 4 itself is but a little more than Thunderbolt 3 + USB4. I’m not sure whether these new ports also support those other additional features of TB4?
  Howard.
  
  LikeLike
- 16
  
  Pico on August 25, 2021 at 7:07 am
  
  USB specs and especially naming has gotten confusing as heck, and it seems like Thunderbolt specs are becoming equally confusing when it comes to Thunderbolt 3 vs Thunderbolt 4.
  
  This article actually breaks it all down pretty well though: https://www.pcmag.com/how-to/what-is-thunderbolt-4-why-this-new-interface-will-matter-in-pcs-in-2021
  
  In there they happen mention the most likely real reason that Apple is not using Thunderbolt 4. And Apple may have just marketed it in this sneaky way so folks wouldn’t feel like their getting last year’s port:
  
  “Intel opened up the Thunderbolt 3 protocol to USB’s controlling consortium (the USB-IF) for royalty-free use in the development of next-generation USB4, delivering faster speeds and interoperability to USB4 devices.”
  
  LikeLiked by 1 person
  - 17
    
    hoakley on August 25, 2021 at 7:57 am
    
    Perhaps we should refer to the M1 implementation as TB3.5 – as it certainly exceeds TB3, and now supports TB4 hubs with 4 ports, as well as USB4. It might also support multiple 4K displays, because the hardware limit there appears to be imposed by the display controller in the M1, rather than TB.
    What a mess – just when you thought that TB was going to be simpler and clearer than USB.
    Howard.
    
    LikeLike
    - 18
      
      Pico on August 25, 2021 at 3:07 pm
      
      Yeah, USB and Thunderbolt specs and naming is quite a mess.
      
      It could be true that Apple’s TB3 implementation in M1 may very nearly match TB4 since TB4 mainly raises the minimum required specs rather than raising any maximums. I think Apple has often surpassed minimum specs in their USB implementations in the past as well. But, I doubt Apple would dare release such technical and detailed data to be able to know exactly how close these TB3 specs are to the TB4 minimums.
      
      Other than the fact that the physical USB-C port itself is wonderful, and the latest USB and TB speeds are fantastic, I’ve grown to loathe USB. The naming has become absolutely absurd and indecipherable and they’ve even retroactively renamed old version to make then even more confusing!
      
      The dream of a single port and single cable sounded like a perfect world until it turned into a single port and cable that always looks the same but either or both could have vastly different specs and often no way to clearly know what specs will be used when you plug this into that with any old cable. It’s become somewhat worse in quite a few cases than just having different ports and cables for different speeds and protocols.
      
      LikeLiked by 1 person
    - 19
      
      Pico on August 25, 2021 at 3:39 pm
      
      USB 3 = USB 3.1 Gen 1 = USB 3.2 Gen 1 = 5 Gbps
      
      USB 3.1 Gen 2 = USB 3.2 Gen 2×1 (aka just USB 3.2 Gen 2) = 10 Gbps
      
      USB 3.2 Gen 2×2 = USB4 (20 Gbps version) = 20 Gbps
      
      USB4 (40 Gbps version) = 40 Gbps
      
      And that’s not even including any Thunderbolt overlaps.
      
      I really think the USB-IF is trolling the world with this naming. It’s just not logical marketing naming. It’s technical implementation jargon that should have been given distinct and obvious (and consistent) names for each speed tier. I never thought that FireWire 400 and FireWire 800 would seem like elegant naming in retrospect.
      
      LikeLiked by 1 person
    - 20
      
      hoakley on August 25, 2021 at 7:07 pm
      
      Thank you.
      If any of this had been brought before a proper standard-making body, it would have been thrown out. This is what happens when marketing departments try to make standards.
      Howard.
      
      LikeLiked by 1 person
    - 21
      
      hoakley on August 25, 2021 at 7:05 pm
      
      Thank you. I agree completely, I regret to have to say.
      Howard.
      
      LikeLiked by 1 person
22

name99 on August 25, 2021 at 5:24 pm

There’s MUCH more to say about this, Howard, but one important aspect of the SoC that you have omitted is the functional DMA.

Essentially as data transits across the chip from a source (eg DRAM) to an endpoint (eg WiFi) it can be modified in a variety of ways. This provides much of the functionality of smartNICs or (in their new terminology) DPUs/IPUs. So the data can be compressed, encrypted, have a CRC calculated, and have packet headers appended or removed.
An additional tweak to make this work well is that the DMA can be routed directly to a cache (as opposed to DRAM). Anyone who has worked on zero copy networking will know why this is a big deal.

This is part of a larger trend which is that Apple really takes the SYSTEM part of SoC seriously. (ARM and Qualcomm may as well; it may be essential for phone level power management; but it’s not an obvious part of the x86/Windows/Linux world).
This means things like all transactions on the NoC have src and type tags and are part of flows, so that flows can be prioritized. (eg ISP/camera stuff are above GPU stuff which is above CPU stuff); the sorts of QoS and bandwidth partitioning that Intel/AMD boast about it for their high end Xeon chips have been part of Apple since probably at least the A6, maybe even part of the basic A4 and A5 designs.
Similarly for cache management which is considered system-wide functionality, eg unused caches in some IP blocks can be used by other IP blocks, or pages can be marked as “streaming” so that data from them will automatically be treated as non-temporal by every cache they pass through, CPU or otherwise (as opposed to requiring special instructions). Again as opposed to the X86 which evolved caches from individual non-co-operating cores, and has made a big deal in recent years over (very limited) abilities to have L3 partitioned and controlled at the system level (and again, only in expensive Xeons).

LikeLiked by 1 person
- 23
  
  hoakley on August 25, 2021 at 7:08 pm
  
  Thank you. Oh there’s much, much more besides. It’s just that Apple is so reticent to document anything.
  Howard.
  
  LikeLiked by 1 person