Armed with my new version of Mints, I’ve now started to look in detail at how Monterey’s Visual Look Up works. In this article I step through its main processes with the aid of its copious log entries, on a T2 Intel Mac and on an M1 Mac, for useful comparison. For this purpose, I’m looking up a single image of Leonardo da Vinci’s Mona Lisa in Mints’ Visual Look Up window.
Visual Look Up (VLU) starts slightly differently depending on whether the image is being viewed in Safari (or a WKWebView in another app like Mints), or in Preview or Photos. In the latter two apps, VLU starts shortly after an image is opened, and completes when the white dot is clicked to display the results. Browser windows start performing their analysis when the contextual menu is opened, proceed apace when the Look Up command is selected, and automatically ‘open’ the white dot to display the results and complete their task.
While VLU is taking place, the image being looked up is opened in the floating window of a QuickLook preview.
When looking up images of paintings, VLU works in two phases: in the first, the image is analysed, classified, and any objects within it are detected, in what’s termed a VisionKit Analyzer process. That is reported as complete by the appearance of one or more white dots on the image. The second phase is visual search, in which the NeuralHash or perceptual hash(es) obtained in analysis are then sent to Apple’s servers, and the best-matching results are returned for display as the information about that image or object within it.
Analysis
VisionKit begins an Analyzer Process Request event, which adds the request to the “Mad” (mediaanalysisd
) interface and starts processing it. VisionKit then submits a MAD Parse request, with both the VisionKit and MAD components having assigned ID numbers which can be used to track them. mediaanalysisd
receives an on-demand image processing request with a PixelBuffer containing the image or part of it for analysis, increases its CPU limit for the next 24 hours, and schedules a foreground task to handle this. Although events are too brief to track this easily, I believe that M1 series Macs run these threads on their Performance cores.
A Visual Search Gating Task is then started, as is a TRI Client, which checks entitlements for this process. mediaanalysisd
looks for “factors” in a “treatment” using a naive cache, and performs a “coarse classification”. During this, Espresso tries different contexts and may add (neural) networks to be used for analysis.
One significant difference between Intel and M1 Macs is the handling of neural networks. With its Apple Neural Engine (ANE), the M1 uses that, whereas Intel Macs lack that hardware support and use CoreML calls. This results in large differences in the time taken using these neural networks, with the M1 typically taking less than half the time of Intel Macs.
Coarse classification and object detection are performed accordingly, involving further Espresso sessions, and there may be annotation extraction too.
At the end of these iterations, the Visual Search Gating Task is declared complete, and VisionKit MAD Parse is also completed. Total processing times reported are typically in the range 100-500 ms. These may be repeated, either for a Document Recognition Task, or another Visual Search Gating Task, until image analysis is complete.
Visual Search
Once the white dot has appeared and the user has clicked on it (or automatically), search is started by VisionKit submitting to MAD a VisualSearch request. Once CPU limits have been increased again for a period of 24 hours, mediaanalysisd
runs a Visual Search Task using client provided “OCR results” from the previous analysis phase. In the case of a painting, after Search E2E is recorded, this is declared as knowledgeSearch.art.
Espresso loads a neural network, in the M1 Mac on its H11 ANE, or using CoreML in an Intel Mac. PegasusKit then sets up the search query, and a TLS 1.3 connection is established with Apple’s servers using port 443 by mediaanalysisd
. It appears that the Mac sends the server the NeuralHash(es) from the analysis for the server to perform matching against its database of NeuralHashes.
PegasusKit publishes the successful RPC response from the servers, and the Visual Search task is complete. VisionKit processes the results, which are then displayed in the VLU floating window.
These processes are summarised in the diagram below. Here’s a free tear-out PDF to take away: VisualLookUp1
This is a simple example, using a painting which doesn’t contain any nested images which require further analysis. Later this week I’ll look at how VLU handles other types of image, including photos containing pets, flowers and landmarks. Comparison against Apple’s published description of its proposed CSAM detection scheme is given here.