Inside Live Text

Live Text is a feature available in Monterey on all Macs officially supported by that version of macOS. It’s not available yet for all languages and types of script: Apple’s official list includes multiple versions of Cantonese, Chinese, English, French, German, Italian, Portuguese and Spanish, but doesn’t yet mention Japanese, although that does appear to work to a degree already.

Other than its use being confined to certain apps, including Safari, Preview and Photos, Apple doesn’t mention any other requirements or limitations. Although not mentioned by Apple, it also works in WKWebKit windows, which may be available in third-party apps. To help you test this, I have now added a sample image to my Visual Look Up test page, readily accessed using my free utility Mints.

If you want to use a similar facility on any text shown on your display, regardless of which app it’s being displayed in, then TextSniper from the App Store should prove ideal.

Although it may appear related to Visual Look Up, which recognises images and objects within them such as dog and cat breeds, and many paintings, Live Text uses a different mechanism to recognise text in images. This doesn’t rely on any information being sent from your Mac anywhere else, nor on Siri or any other part of macOS. Neither do you need to access contextual menus or open Information windows: Live Text just works, and lets you select the text it has already recognised, as if by magic.

What happens in Live Text

When you open an image containing text which Live Text could recognise, macOS doesn’t try to recognise any text at first, but appears to segment that image into areas which it thinks contain recognisable text.

The feature is triggered when the pointer is passed over an area suspected to contain text. This initiates the process of text recognition and changes the pointer to an I-beam. First, VisionKit determines whether the device supports analysis. If it does, it then begins a request for an Image Analyzer Process, and parsing of that image area by mediaanalysisd.

Like Visual Look Up, analysis is performed using neural networks. Where available (currently only Apple M1 series chips), this is performed on the Apple Neural Engine (ANE), otherwise it’s run on the CPU, and managed by Espresso, which is responsible for this Machine Learning support.

After a few runs of neural networks, mediaanalysisd creates a Composite Language Model of the text, capable of handling multiple languages. The underlying language(s) are recognised using linguistic data within macOS. This is apparently performed for each block of text recognised in the image, until the Document Recognition task is completed.

With the recognised text parsed ready for the user to select, Vision Kit declares the media analysis complete, and writes the time taken to the log. For readily recognised English, this is typically around 400-800 ms. The image analysis is declared complete, and the recognised text is then available for the user to select.

When you have selected some of the text recognised in the image, you can then copy it, or Control-click on selected text to access other services such as translation and Look Up. Live Text works with short snippets such as phone numbers, and with whole windows. When recognising Latin/Roman characters, it also works with projected and vertical arrangements of characters. On a good day it can even extract text which you may find difficult to decipher.

Live Text is an excellent example of a simple but powerful tool which relies on the machine learning features built into macOS, and is accelerated when used on M1 Macs with their ANE hardware.