The hunt for the M1’s neural engine

After yesterday’s explanation of the Apple Neural Engine (ANE) in M1 series Macs, it’s time to go in search of this elusive part of Apple Silicon chips, and discover how it’s used.

So far, all I have seen of the ANE are some entries in the log, and claims for how much it can accelerate Machine Learning (ML). The ANE first makes its appearance early during the kernel phase of the boot process, initially through its Load Balancer, which is initialised with
virtual bool H1xANELoadBalancer::init(OSDictionary *) WE are here: H1xANELoadBalancer: 30
H1xANELoadBalancer::probe
virtual IOService *H1xANELoadBalancer::probe(IOService *, SInt32 *) WE are here: H1xANELoadBalancer: 51, res: <private&rt;, score: 0
H1xANELoadBalancer::start, provider: <private&rt;
H1xANELoadBalancer::start, res: 1
WE are here: H1xANELoadBalancer
virtual bool H1xANELoadBalancer::start(IOService *) :H11ANEIn::start - ANE LB is enabled by default!

That’s followed by the ANE interface:
static AppleH1xANEMsgQue *AppleH1xANEMsgQue::WithLoadBalancerInterface(H1xANELoadBalancer *)
static AppleH1xANEMsgQue *AppleH1xANEMsgQue::WithLoadBalancerInterface(H1xANELoadBalancer *) line 37
IOReturn AppleH1xANEMsgQue::startMsgQ(H1xANELoadBalancer *)
static AppleH1xANEMsgQue *AppleH1xANEMsgQue::WithLoadBalancerInterface(H1xANELoadBalancer *) Done

then the Load Balancer again
H1xEventSource::initWithOptions
H1xEventSource::enable
Done H1xANELoadBalancer, start H1xANEMessages workloop

followed later by other messages recording initialisation of the ANE interface.

After that, ANE seems to get along with its work quietly. During Visual Look Up, mediaanalysisd is a bit more forthcoming, posting short sequences at the start of each run of the ANE like
Found matching service: H1xANELoadBalancer
Found ANE H1xANELoadBalancer :1
Found matching service: H11ANEIn
Found ANE device :2
Total num of devices 2
LB Found at index 0
(Single-ANE System with LB) Opening H11ANE device
H11ANEDevice::H11ANEDeviceOpen, usage type: 1
H11ANE Device Open succeeded with usage type: 1
(Single-ANE System with LB) Selected H11ANE device

Those are followed by a series of messages written by com.apple.Espresso, which manages the ANE during Visual Look Up.

I next looked for an app which might load the ANE. Geekbench ML looked promising, but so far only seems available for iOS and Android, so I built my own using one of Apple’s ML demo projects. That ran consistently faster on M1 series Macs than on this Intel iMac Pro. For instance, the median time to complete one learn and test cycle was 2.26 seconds on Intel, and 1.21 seconds on M1 Macs. That’s a similar magnitude of acceleration that I have seen during neural network sections of Visual Look Up.

The acid test, though, was inspecting the responses of the ANE using powermetrics. That’s something of a monster, which can dump fine detail about each core and a great deal more, including a little about the ANE. Once running my test ML app, it was soon taking 100% active residency on the first of the P cores at a frequency of 3036-3228 MHz. However, that app didn’t make the ANE or GPU blink: ANE power remained steadfastly at 0 mW, with GPU power at only 5 mW and idling at 4 MHz. That’s in spite of the app making plenty of calls to use BNNS, and running for several seconds.

It was therefore time to return to the one feature which I know reliably invokes the ANE repeatedly, Visual Look Up (VLU). Running a series of powermetrics samples during successful VLU finally demonstrated the salience of the ANE.

In the first sampling period of a second, ANE drew 30 mW power, dropping slightly to 22 mW in the second period, then peaking at 49 mW in the third. ANE read reached a maximum of 232 MB/s and write 167 MB/s. GPU power and frequency remained low, suggesting that the ML code in VLU is normally handled by the ANE. Those figures correspond to the first phase of VLU involving analysis and parsing. The second, of visual search, was accompanied by shorter and lower ANE activity.

Conclusions

  • Visual Look Up on M1 series Macs consistently uses the Apple Neural Engine.
  • VLU’s call chain is through mediaanalysisd, which uses Espresso to manage the ANE. Espresso is also used on Intel Macs to manage neural networks run on their CPU cores.
  • Maximum power drawn by the ANE was 49 mW, which is low even in comparison to that required by E cores.
  • The ANE reached high peak (memory) read and write rates of 232 and 167 MB/s during use.
  • BNNS functions may be run entirely on CPU cores, even when they appear suitable for the ANE, and the ANE is available.
  • powermetrics is the tool of choice for observing ANE activity; indeed, it appears to be the only tool.