How Visual Look Up works in detail 2: Object recognition and Live Text

In my first article looking in detail at how Visual Look Up works, I considered a painting, which resulted in a search type of knowledgeSearch.art. As others have pointed out, recognising an image of a painting isn’t entirely novel, as that’s something already available using Google Image Search, although that’s less convenient and less informative than using Visual Look Up. What isn’t readily available elsewhere is identifying the breed of dog, species of flower, or well-known landmarks. This article examines how those are performed in Visual Look Up (VLU), and how they contrast with Live Text.

VLU operates the same way when working on photographic images as on those of paintings. In Preview or Photos, image analysis starts shortly after opening the image; in Safari or a WKWebView, it starts when the user opens the contextual menu for the image, using Control-Click. These initiate the first of the two phases, in which the image is analysed, classified and objects within it are detected and subjected to similar analysis.

Once those are complete and the white dot shown on the recognised object, Visual Search can be performed, either automatically or by clicking the white dot. This results in VLU sending Apple’s servers NeuralHash(es) for the object(s) identified, and the return of search results, which are then displayed in a floating window. In addition to using the search category knowledgeSearch.art, others used include knowledgeSearch.nature (flowers), knowledgeSearch.landmark (landmarks) and pets (pets).

Analysis

The same Analyzer and MAD Parse processes start the analysis phase, with a Visual Search Gating Task and entitlement check. Iterative use of neural networks is performed using the same method, which on an M1 Mac with its Apple Neural Engine (ANE) starts with discovery of the ANE and its services. mediaanalysisd then declares search purposes, such as coarseClassification or objectDetection, or categories. Espresso creates and destroys contexts, creates a plan, and loads the network. When those have been completed, search categories are declared for the search to be performed.

The ANE is referred to in several log entries as an H11; I’m grateful to @jankais3r for pointing out that this may be misleading. The H11 was the ANE included in the A12 chip, and it’s thought that the model in M1 series chips is the H13 instead, which appeared in the A14 as well. It may well be that the macOS API continues to refer to the ANE as H11ANE, even though it’s now the H13.

Visual Search

The same white dot is used to indicate the analysis is complete and VLU is ready to search the object(s) it has identified and generated NeuralHashes for. Where multiple objects have been recognised, multiple white buttons are shown, located in the centre of each of the objects. It’s early during this phase that mediaanalysisd declares the search categories.

As VLU works currently, objects of categories other than knowledgeSearch.art can’t be recognised within an image which has been assigned the category knowledgeSearch.art. This allows VLU to identify paintings within a painting, but not dogs, flowers or landmarks within paintings. The latter three categories only appear to be recognised when the whole image isn’t categorised as knowledgeSearch.art, that is a photographic image.

An identical TLS 1.3 connection over port 443 is then used to send the search queries to Apple servers, and the result is processed and displayed in the corresponding VLU floating window.

Live Text

Triggering Live Text recognition is quite different from VLU. This appears to occur when the pointer is placed over the part of the image containing text which can be recognised. Although the same Analyzer and MAD Parse processes are started, instead of starting a Visual Search Gating Task, a Document Recognition Task is run instead. This doesn’t require any entitlements, and after loading a couple of neural networks, mediaanalysisd creates a Composite Language Model and Linguistic Data are queried. MRC Parsing is completed quickly, which ends the MAD Parsing and calls a completion handler.

These can occur whenever it’s considered that an image may contain text. For example, an area in which there is a regular pattern which could conceivably be text can trigger this in an image which also contains a recognisable pet.

As far as I can see, these only confirm my original graphical summary:
VisualLookUp1
and its PDF: VisualLookUp1

I’ll next be looking at what can prevent VLU from working.

6Comments

Add yours

1

Christian on March 25, 2022 at 5:28 pm

Thank you for Part 1 and Part 2, with an easily comprehensible explanation of the technical connections. I am surprised that no discussion has (yet) arisen on this and with my first contribution I would like to express my dismay at having Apple sitting on my shoulder, so to speak, in order to participate in real time in what I currently have “under the mouse”.

Part 3 is therefore the most interesting part for me, because apparently there will be ways to put a stop to this principle of a “constant interrogation”. I am very much looking forward to part 3 and thank you for it already.

LikeLiked by 1 person
- 2
  
  hoakley on March 25, 2022 at 7:54 pm
  
  Thank you.
  If you look at my diagram carefully, you’ll see that search is only performed if you:
  – use the Look Up command in Safari, or another WKWebView
  – click on the white dot in Preview or Photos.
  So if you don’t want your Mac to connect to perform search, don’t use either of those features. Simple.
  Compare with the alternative, which is Google’s Image Search, and takes the whole image and as many details as it wants about you. And you just know with Google that will be fed into its money-making data business, don’t you?
  I suspect that all Apple retains about your connection is the neural hash(es) your Mac sent. It certainly doesn’t get a copy of the image.
  It’s your choice.
  Howard.
  
  LikeLike
  - 3
    
    Javier Gallardo on March 25, 2022 at 8:55 pm
    
    …But there’s an analysis made remotely prior to asking for search, (just by opening file in Preview) if I have understood correctly… Or perhaps this first analysis is simple enough to be made locally? (In diagram I can’t tell clearly; “Check language & country”+kind?)
    
    LikeLiked by 1 person
    - 4
      
      Javier Gallardo on March 25, 2022 at 9:01 pm
      
      Oh, sorry: I understand it’s the pink part of diagram. (But still: locally or remotely. I’m not too worried about privacy, and understand analysis tries to be respectful, but I don’t like the idea. Couldn’t the “pink part” be made only by request?)
      
      LikeLiked by 1 person
    - 5
      
      hoakley on March 25, 2022 at 9:53 pm
      
      The pink phase is performed entirely locally. If that were omitted, then there’d be no neural hashes to look up, and the process would fail completely. External search is performed in the yellow phase, using the neural hashes computed in the pink phase. It’s that which has to access the external service, as the full database of neural hashes can’t be stored locally. Without that, there’s no point to the process, as there would be no answer.
      If you don’t want Visual Look Up, it’s not difficult to avoid triggering it.
      Howard.
      
      LikeLike
    - 6
      
      hoakley on March 25, 2022 at 9:49 pm
      
      No: checking language and country is a local check to see if they’re supported by Visual Look Up. There’s no evidence that checks availability with any servers. The neural networks in the first phase are completely local too, and no images are uploaded to any external service.
      Howard.
      
      LikeLike

Share this:

Related