Last Week on My Mac: The magic is how not what

Compared with Visual Look Up, Live Text appears straightforward: open an image containing recognisable text, macOS runs OCR on it, and lets you select what it succeeds in recognising. But when I came to read the log to confirm how this worked, that wasn’t what happened at all.

Fortunately, following my deep dive into what happens in Visual Look Up, I already had the tool to do the same for Live Text, in my utility Mints. So I opened an image in Preview at exactly 13:38:00, waited ten seconds for OCR to complete, then moved the pointer over some of the text and selected it. Armed with the filtered log content from Mints, and a complete record from Ulbow, I set about tracing how VisionKit and mediaanalysisd recognised the text and made it available to the user.

What I couldn’t understand, though, was how the log recorded in great detail how the image had been opened in the moments after 13:38:00, but VisionKit didn’t write anything to the log for ten seconds, at 13:38:10, when I had moved the pointer over the text and it changed into an I-beam ready for me to select what I wanted to copy or process.

Obviously, I had made a mistake somewhere, so I repeated the test and got exactly the same result. I was so disbelieving that, after I had written my account of this, I thought I had better run a third test just in case the first two were somehow flawed. The result was the same.

It took a day after the article was published here before someone pointed out to me how I was completely wrong. Ben wrote:
“When you open an image, clearly Live Text doesn’t wait until you move the mouse pointer to begin. If it did then we would see continuous text detection processes churning the CPU and/or ML cores while moving the mouse cursor over image views.”

So I’ve gone back and repeated my test, this time using the WKWebView built into Mints, rather than Preview.

This time, I opened the window in Mints a smidgin before 20:58:00. I then scrolled to the end of the page, past the Live Text sample, at 20:58:05, scrolled back to the sample at 20:58:10 and paused there to allow time for OCR on its text to complete. At 20:58:25 I moved the pointer over the text area, saw it change from the standard arrow to an I-beam, and after 20:58:30 selected the text that had been recognised.

I then had two log extracts to work with: the filtered entries from Mints, and a whole-log record, both starting from 20:58:00 and running for forty seconds until the recognised text had been selected. If you want to read a fuller summary of those entries, I provide them in an appendix at the end. Here I’ll just quote the most important to this account.

At 20:58:00, the log is full of the lengthy exchanges required to connect to this blog and obtain its test page. Those run from
20:58:00.003 CFNetwork Connection 1: TLS handshake complete
down to
20:58:01.622 Mints SafariSafeBrowsing Service did finish database update with protection type 1 and provider 1
by which time the WKWebView was open and displaying the top of the test page.

After that, there’s not one entry from VisionKit or any other subsystem related to Live Text or Visual Look Up until
20:58:26.428 VisionKit : Device Supports Analysis: YES
following which there’s a detailed account of the OCR taking place, including linguistic analysis, concluding just over half a second later with
20:58:27.093 VisionKit : END "VisionKit MAD Parse Request"
20:58:27.093 VisionKit : Request completed: <private>
20:58:27.093 VisionKit : Calling completion handler For Request ID:1
Total Processing Time 664.27ms
Has Analysis: YES
request: <private>
Error: (null)

Whatever we might like to think happens in Live Text, it’s clear that text recognition isn’t triggered when an image is loaded into a window, or scrolled into view. Think about this for a few moments, and that would prove disastrous when loading a page or document containing fifty embedded images each with recognisable text, or scrolling through that. Instead, this on user demand, and just in time, when you go to access the text that hasn’t yet been recognised at all.

Look back at my account of how Visual Look Up is triggered, where I wrote:
“Visual Look Up (VLU) starts slightly differently depending on whether the image is being viewed in Safari (or a WKWebView in another app like Mints), or in Preview or Photos. In the latter two apps, VLU starts shortly after an image is opened, and completes when the white dot is clicked to display the results. Browser windows start performing their analysis when the contextual menu is opened, proceed apace when the Look Up command is selected, and automatically ‘open’ the white dot to display the results and complete their task.”

What’s magic about Visual Look Up and Live Text isn’t what they do, but how they do it. There are plenty of apps offering OCR, and services to provide information about recognisable objects in images. While both macOS services benefit from using the hardware neural engine in M1 series chips and iOS/iPadOS devices, their major advantage is availability and ease of access, to the point where most users are unaware of what’s going on. What you then read in the log doesn’t appear to make any sense at all.

Appendix: Log Extracts

Mints’ Visual Look Up window (WKWebView) opened from the Window menu just before 00 seconds on an otherwise idle Mac Studio with M1 Max chip. All times in seconds.

First this makes the connection to load the page:
00.003 CFNetwork Connection 1: TLS Trust result 0
00.003 CFNetwork Connection 1: TLS handshake complete

down to
01.622 Mints SafariSafeBrowsing Service did finish database update with protection type 1 and provider 1

Following that, the first log entry from any of the sub-systems involved in Live Text, marking the start of text recognition:
26.428 VisionKit : Device Supports Analysis: YES
26.429 VisionKit : BEGIN "VKImageAnalyzerProcessRequestEvent"
26.429 VisionKit : BEGIN "VisionKit MAD Parse Request"

Following those, Espresso, responsible for managing neural networks, starts up:
26.517 Espresso espresso : Creating context 5577639120 eng=10007 dev=-1
26.517 Espresso espresso : Destroying context 5577639120
26.517 mediaanld VCPMADVIDocumentRecognitionTask running...

26.542 Espresso espresso : Loaded network: '<private>' pf=10007 cp=1
then the ANE (Apple Neural Engine) is opened:
26.582 mediaanld (Single-ANE System with LB) Opening H11ANE device
26.582 mediaanld H11ANEDevice::H11ANEDeviceOpen, usage type: 1
26.582 mediaanld H11ANE Device Open succeeded with usage type: 1
26.582 mediaanld (Single-ANE System with LB) Selected H11ANE device
26.583 mediaanld client : Daemon status changed from 0 to 1

Following recognition of the text, language models are engaged:
26.859 mediaanld CompositeLanguageModel : Creating CompositeLanguageModel (<private>): <private>
26.859 mediaanld CompositeLanguageModel : loading resource id 0 type 7 with path <private>
26.861 mediaanld Default : NgramModel: Loaded language model: <private>
26.861 mediaanld LanguageModelingSignposts : __##__signpost.description#____#begin_time#_##_#42035910138##__##
26.861 mediaanld LanguageModelingSignposts : <private>

The text is recognised as being in English:
26.915 mediaanld Notice : -[MAAssetQuery getResultsFromMessage:]: Got the query meta data reply for:, response: 0
26.915 mediaanld Query : assetsForQuery: <query:, locO: 1, iO: 1, latO: 1, <filter: {
AssetLocale = "{(\n en\n)}";
AssetType = "{(\n Delta\n)}";
}>> final result: (
) was cached: 0, cachedOnly: 0

This progresses with:
26.963 mediaanld CompositeLanguageModel : Creating CompositeLanguageModel (<private>): <private>
26.963 mediaanld CompositeLanguageModel : loading resource id 0 type 0 with path <private>
26.963 mediaanld CompositeLanguageModel : loading resource id 1 type 42 with path <private>
26.963 mediaanld CompositeLanguageModel : loading resource id 2 type 40 with path <private>

and so on for more resources.

Shortly afterwards, the text recognition task is complete:
27.070 mediaanld VCPMADVIDocumentRecognitionTask complete

This feeds up to VisionKit:
27.087 VisionKit : Completed MRC Parsing of 0 elements in 0.000001 seconds.
and concludes with
27.093 VisionKit : END "VisionKit MAD Parse Request"
27.093 VisionKit : Request completed: <private>
27.093 VisionKit : Calling completion handler For Request ID:1
Total Processing Time 664.27ms
Has Analysis: YES
request: <private>
Error: (null)

The whole OCR task here took 0.665 seconds by log entries, matching that given. Typical processing times reported on test images range between 0.4-0.8 seconds, using the ANE.