Last Week on My Mac: Spotlight on semantics

You may have noticed one phrase that was repeated throughout much of WWDC earlier this month, semantic search. Although it had appeared occasionally in the past, this year it came up in more than a dozen presentations, starting in the Platforms State of the Union on day 1. Just what is changing in Spotlight that is semantic?

In traditional search of text content, Spotlight discovers in its content indexes each file containing the search term you have provided. When you search for the term cow, it should return only those files containing those exact characters. This is inevitably a bit more complex, as we normally want search to be case-insensitive, and there are other rules we might want to apply, such as whether that should return words like cower where the term is a prefix, or the place-name Cowleaze, where it’s also a capitalised location name. Those are normally determined by a set of language-specific rules for the Unicode collation applied.

Where there are many hits, as occurs when searching the internet, search ranking can be used to return and order those websites that contain the term and are the most frequently visited, or using a more complex ranking algorithm. But that is of limited use when searching local files.

Semantic search is different, in that its matches aren’t as crisp and Boolean. Rather than working like a simple index, it’s more like a thesaurus in effect. This associates the word cow with a meaning, such as a mature female ox of the species Bos taurus, then looks up related concepts. Some will be close matches, like cattle, bovid, or ungulate, others might be related terms like heifer, an immature cow, and other terms with similar or related meaning.

Semantics is heavily dependent on context. If you’re a farmer, you won’t be interested in the females of other species also known as cows, such as elephants and rhinos, which a zoologist would want to include. A more general audience might want its slang association for a disagreeable woman as another of its associations. There are also regional variations: in US English, cow commonly refers to both sexes and all ages of oxen, while in Australian and New Zealand English it can extend to almost anything that’s deemed objectionable.

In the days before AI, this type of search was often referred to as fuzzy, compared to the crisp black-and-white of regular search, as it not only returns hits that contain the specified term, but those for a grey zone of related terms.

One way to envisage this is to represent concepts, encapsulated as tokens, in multi-dimensional space. Each concept can be located by its coordinates, and by calculating the distance between any two concepts you can express how closely related they are. Semantic search thus tries to discover files and other items of similar and related concepts.

Earlier search methods did this using explicit lists of terms. For example, the photo below shows a few Belted Galloway cattle grazing in a field on chalk downs near here.

Traditionally, if I were maintaining my own image library I’d have to enter detailed information about that image to be stored in Exif metadata, a time-consuming task that’s also prone to error. I could get the location or breed wrong, but we now have the benefit of GPS to ensure at least the location is accurate.

More recently we’ve been able to get images analysed automatically, and in that case it returned a set of keywords to identify the contents:
{animal, cow, mammal, ungulates, outdoor, grass, land, sky, cloudy, "blue sky", plant, shrub}
If we then search for images with the keyword of cow, that should appear in the results, but it omits semantically similar words such as cattle or oxen.

Rather than compiling more exhaustive sets of keywords, semantic search can broaden the scope to cope better. And because we can interact through Siri, we can fine-tune our search results by specifying the cattle should be black and white, perhaps, and combining conventional search criteria such as location.

To get this to work effectively, there are some limitations. Because semantics are so contextual and variable, this involves apps and Core Spotlight. That’s a big benefit to user privacy, as Core Spotlight’s indexes are separated by user and stored locally, although in places like ~/Library/Metadata rather than volume-based Spotlight indexes in the existing hidden .Spotlight-V100 folders. And unlike global Spotlight indexing and search, it requires apps to have code to support both tasks, as it can’t just happen by magic.

While I’m sure we’ll all be impressed with many of the results of semantic search, hits that we never expected to find, it’s going to prove harder to assess those that it misses. That’s the more concerning aspect of the performance of all search systems, and in many cases how we will judge their value. Even if you aren’t impressed yet by other advanced AI coming in Golden Gate, semantic search could prove decisive.