There’s a clear link between Monterey’s Visual Look Up (VLU) feature and Apple’s proposals last summer to introduce the detection of Child Sexual Abuse Material (CSAM) in images. Don’t misread that into thinking that VLU is the thin end of the CSAM wedge, though.
In its unsuccessful efforts to persuade us how safe its CSAM detection methods were, Apple published documents explaining how it was intending to compare images using perceptual hashing, in a function it terms a NeuralHash. This maps images to numbers which can then be used to search a database of NeuralHashes.
This starts with the computation of image descriptors which characterise an image, constructed so that images which are perceptually and semantically similar have descriptors that are close to one another. Those real-valued vectors are then converted into integer hash values using a Hyperplane Locality Sensitivity Hashing process, so that different images produce different hashes.
Those image descriptors are generated by a neural network trained through a self-supervised scheme, in which the network is trained to generate close descriptors for pairs of original and perturbed images which are intended to remain perceptually identical, and distant generators for pairs of unrelated images.
In its application to CSAM detection, a special protocol is used in conjunction with other techniques to ensure that Apple learns the NeuralHashes only for those images suspected of being CSAM. That appears unnecessary for Visual Look Up, where there shouldn’t be any need for secrecy, other than standard privacy protection.
You can trace VLU at work in the log. Early signs include VisionKit sending a request to its Image Analyzer to process an image, and a request for VisionKit MAD Parsing. mediaanalysisd
then performs Media Analysis and calls Espresso, which appears responsible for neural networks. Initial attempts are made at text recognition, which are normally unsuccessful for paintings. PegasusKit is then invoked for the search. Next Argos, a part of Siri, invokes VisualIntelligence for a coarse classification, and object detection.
Following this, VisionKit starts a VisualSearch request. Argos then announces the search type as knowledgeSearch.art. mediaanalysisd
makes a TLS connection, presumably to Apple’s search service. A successful RPC response from that is published by PegasusKit, and completes the VisualSearch query, in one example taking 0.41 seconds.
This coming week, I will be adding support to my utility Mints to extract log entries relating to VLU, to make it easier to troubleshoot and understand.
The evidence is that, at least as far as identification of paintings is concerned, VLU is serendipitous. If I were to devise a strategy for identifying paintings, this approach would only be a part of my overall solution. Inspecting the file name, and any image metadata, can often narrow the search down considerably, and any technique which looked a gift horse like BonnardMartheBath.jpg in the mouth would be needlessly inefficient.
There are other properties commonly seen in images of paintings which are also valuable aids to identification, such as the texture of the painting, choice of palette, level of detail, and of course the presence of an artist’s signature. Even the information returned by VLU for recognised paintings confirms the lack of involvement of an art specialist in this project: the date of creation is referred to as that “Established”, no media are given, and of the two standard dimensions only height is given. Even a cursory glance at the captions of the paintings shown in this blog reveals how ungainly VLU is in reporting its results.
It thus looks like Visual Look Up, for paintings at least, does use part of Apple’s technology intended for CSAM detection. While VLU is a wonderful feature, it looks more like a fortuitous accident and a demonstration of what might come elsewhere, not a goal in itself.
Apple could well be using VLU as a test-bed to improve CSAM detection for release in, say, macOS 13. Once you have developed and optimised this type of detection system, the next step is to assess its performance on large test series, which could include all the images we use VLU on. However, that requires measuring its predictive accuracy, something that isn’t happening with VLU. At present, despite obvious benefits to those who use it, Apple doesn’t seem to get anything of use from Visual Look Up. Maybe it is a free lunch after all.