The hunt for the M1’s neural engine

After yesterday’s explanation of the Apple Neural Engine (ANE) in M1 series Macs, it’s time to go in search of this elusive part of Apple Silicon chips, and discover how it’s used.

So far, all I have seen of the ANE are some entries in the log, and claims for how much it can accelerate Machine Learning (ML). The ANE first makes its appearance early during the kernel phase of the boot process, initially through its Load Balancer, which is initialised with
virtual bool H1xANELoadBalancer::init(OSDictionary *) WE are here: H1xANELoadBalancer: 30 H1xANELoadBalancer::probe virtual IOService *H1xANELoadBalancer::probe(IOService *, SInt32 *) WE are here: H1xANELoadBalancer: 51, res: <private&rt;, score: 0 H1xANELoadBalancer::start, provider: <private&rt; H1xANELoadBalancer::start, res: 1 WE are here: H1xANELoadBalancer virtual bool H1xANELoadBalancer::start(IOService *) :H11ANEIn::start - ANE LB is enabled by default!

That’s followed by the ANE interface:
static AppleH1xANEMsgQue *AppleH1xANEMsgQue::WithLoadBalancerInterface(H1xANELoadBalancer *) static AppleH1xANEMsgQue *AppleH1xANEMsgQue::WithLoadBalancerInterface(H1xANELoadBalancer *) line 37 IOReturn AppleH1xANEMsgQue::startMsgQ(H1xANELoadBalancer *) static AppleH1xANEMsgQue *AppleH1xANEMsgQue::WithLoadBalancerInterface(H1xANELoadBalancer *) Done
then the Load Balancer again
H1xEventSource::initWithOptions H1xEventSource::enable Done H1xANELoadBalancer, start H1xANEMessages workloop
followed later by other messages recording initialisation of the ANE interface.

After that, ANE seems to get along with its work quietly. During Visual Look Up, mediaanalysisd is a bit more forthcoming, posting short sequences at the start of each run of the ANE like
Found matching service: H1xANELoadBalancer Found ANE H1xANELoadBalancer :1 Found matching service: H11ANEIn Found ANE device :2 Total num of devices 2 LB Found at index 0 (Single-ANE System with LB) Opening H11ANE device H11ANEDevice::H11ANEDeviceOpen, usage type: 1 H11ANE Device Open succeeded with usage type: 1 (Single-ANE System with LB) Selected H11ANE device

Those are followed by a series of messages written by com.apple.Espresso, which manages the ANE during Visual Look Up.

I next looked for an app which might load the ANE. Geekbench ML looked promising, but so far only seems available for iOS and Android, so I built my own using one of Apple’s ML demo projects. That ran consistently faster on M1 series Macs than on this Intel iMac Pro. For instance, the median time to complete one learn and test cycle was 2.26 seconds on Intel, and 1.21 seconds on M1 Macs. That’s a similar magnitude of acceleration that I have seen during neural network sections of Visual Look Up.

The acid test, though, was inspecting the responses of the ANE using powermetrics. That’s something of a monster, which can dump fine detail about each core and a great deal more, including a little about the ANE. Once running my test ML app, it was soon taking 100% active residency on the first of the P cores at a frequency of 3036-3228 MHz. However, that app didn’t make the ANE or GPU blink: ANE power remained steadfastly at 0 mW, with GPU power at only 5 mW and idling at 4 MHz. That’s in spite of the app making plenty of calls to use BNNS, and running for several seconds.

It was therefore time to return to the one feature which I know reliably invokes the ANE repeatedly, Visual Look Up (VLU). Running a series of powermetrics samples during successful VLU finally demonstrated the salience of the ANE.

In the first sampling period of a second, ANE drew 30 mW power, dropping slightly to 22 mW in the second period, then peaking at 49 mW in the third. ANE read reached a maximum of 232 MB/s and write 167 MB/s. GPU power and frequency remained low, suggesting that the ML code in VLU is normally handled by the ANE. Those figures correspond to the first phase of VLU involving analysis and parsing. The second, of visual search, was accompanied by shorter and lower ANE activity.

Conclusions

Visual Look Up on M1 series Macs consistently uses the Apple Neural Engine.
VLU’s call chain is through mediaanalysisd, which uses Espresso to manage the ANE. Espresso is also used on Intel Macs to manage neural networks run on their CPU cores.
Maximum power drawn by the ANE was 49 mW, which is low even in comparison to that required by E cores.
The ANE reached high peak (memory) read and write rates of 232 and 167 MB/s during use.
BNNS functions may be run entirely on CPU cores, even when they appear suitable for the ANE, and the ANE is available.
powermetrics is the tool of choice for observing ANE activity; indeed, it appears to be the only tool.

7Comments

Add yours

1

RobinDumontChaponet on March 30, 2022 at 9:53 pm

Hi. Doesn’t Final Cut Pro X video analysis use the ANE ? (Like the subject centering auto framing)

LikeLiked by 1 person
- 2
  
  hoakley on March 30, 2022 at 10:01 pm
  
  I’m sorry, I don’t know. It should be straightforward to find out using powermetrics.
  Howard.
  
  LikeLike
  - 3
    
    RobinDumontChaponet on March 30, 2022 at 10:17 pm
    
    No problem. Simple curiosity over which programs can access it. …I don’t even own an M1 machine.
    
    (I do own an eGPU that since macOS 12 I cannot eject without an alert mentioning that medianalysisd is still using the GPU. But that is another although somewhat related subject.)
    
    Thank you for your response.
    
    LikeLiked by 1 person
4

mot on August 30, 2022 at 1:52 am

The Neural Engine is only used to do inference, i.e., feed inputs into a neural network and get outputs. It isn’t used for training (learning). It probably doesn’t have enough floating point precision. The GPU should be used to accelerate training.

I wrote a program that uses Core ML to do inference repeatedly (with a medium-size neural network). Running this program on my M1 Pro, the ANE’s power usage is 2 watts. I think that may be the maximum amount of power it can use.

LikeLiked by 1 person
- 5
  
  hoakley on August 30, 2022 at 5:44 am
  
  Thank you. That’s still quite a significant amount of power, around two P cores at full pelt.
  I don’t suppose you’ve made that app or code available, please? One tool we don’t seem to have at the moment is anything to compare the performance of ANEs in different chips.
  Howard.
  
  LikeLike
  - 6
    
    mot on August 30, 2022 at 3:23 pm
    
    Sorry, I’m doing this for a private project. I agree that it would be nice to have a benchmark for the Neural Engine though. Maybe someday.
    
    BTW, one P-core under load uses almost exactly 3.5 watts. At least on my M1 Pro, according to powermetrics.
    
    LikeLiked by 1 person
    - 7
      
      hoakley on August 30, 2022 at 6:40 pm
      
      No problem. We can debate the power another time, perhaps 🙂 Either way, in core terms that’s quite a pull.
      Howard.
      
      LikeLike

Share this:

Related