From its inception Spotlight was designed to encompass multiple types of search, among those searching the metadata and contents of local files, as detailed in Apple’s patent by Yan Arrouye and Keith Mortensen filed in 2000 (see References at the end), five years before Spotlight was released. Although documentation of local search is limited, a series of patents awarded to Apple provides deeper insights. At the heart of local search are hidden index and supporting files in each volume’s .Spotlight-V100 folder, which is served by the mds daemon and its helpers.
Indexing
When any file in a watched folder is created or saved, Spotlight re-indexes that file for the volume’s Spotlight indexes. The process runs like this:
- The file change is recorded in that volume’s FSEvents database, in the volume’s .fseventsd folder.
- FSEvents notifies Spotlight that a file has been created or changed, prompting Spotlight to re-index the content and metadata of the file(s) concerned.
- One of the multiple copies of the
mdworkerdaemon checks the type (UTI) of the changed file, and locates the appropriate mdimporter plugin bundle for that type. In the case of Rich Text files, this is /System/Library/Spotlight/RichText.mdimporter, for example. Additional plugins can be found in /Library/Spotlight/ and sometimes ~/Library/Spotlight/, but most app-specific plugins are now stored in the Library/Spotlight folder inside that app. - The
mdworkeruses the mdimporter code to generate indexed content for the changed file. - That indexed content is then added to the Spotlight files in the volume’s .Spotlight-V100 folder, for use in future searches.
That series of steps is usually completed within a second or two of the file being created or edited, and both metadata and content are available to search shortly afterwards.
Content extracted from each file that’s indexed by an mdworker process includes:
- file attributes, such as datestamps;
- extended attributes, stored in the file system metadata; these include keywords and copyright information, where provided;
- structured metadata from the main data in the file, as specified by that file type’s
mdimporterplugin; examples include EXIF data; - content, normally text, exported from the main data of the file, again using the
mdimporterplugin.
Evidence from multiple patents shows that file metadata and content are indexed separately. Content appears to go through a conventional processing sequence:
- text content is divided into tokens at word boundaries, and most frequent words such as the may be eliminated as stop words;
- a stemmer may be used to derive word stems, and prefixes may also be separated;
- an indexer generates an inverted index.
Tokenisation of file names uses rules for word boundaries laid down in the International Components for Unicode. In practice, word boundaries include a space, the underscore _, hyphen – and changes of case used in CamelCase. At least in file names, Spotlight treats each of the following examples as three words:one target two
one_target_two
one-target-two
OneTargetTwo
Languages other than English may allow other word boundaries, but those are the most common.
Indexes
A volume’s hidden index folder contains the store itself, in a folder named Store-V2, and VolumeConfiguration.plist, a standard property list containing several dictionaries:
- Annotations, a large dictionary containing Creation_Predicates, another dictionary with extensive settings
- datestamps of creation and modification, with version numbers
- Exclusions, an array of excluded paths
- Options, a brief dictionary
- Stores, with the UUID of the store directory, datestamps, versions and other details.
The store directory is named using the UUID given in VolumeConfiguration.plist, and holds around 99 files and folders containing that volume’s store. Of those, store.db uses a proprietary format that has been reversed by Yogesh Khatri, who provides a parser here, and that’s relatively small, as the dictionaries, indexes and postings are contained in the many other files. The diagram below outlines this structure and lists some of the contents of the store.
Items shown in blue are directories, those in red are most likely to change soon after files are changed, and the ellipsis … after a name indicates there are multiple items with that as a prefix. Note that, while Core Spotlight has its own journal directory, other files don’t appear to separate its indexes.
Inverted indexes
Spotlight’s indexes are based on what is known by convention as the inverted index. At its most basic, this consists of a dictionary together with a series of posting lists for each of the tokens in that dictionary. Posting lists reference the location of the occurrence of that token in the documents that have been indexed.
For example, suppose the token light has been obtained by tokenisation of a text file. For that token in the dictionary, there will be a postings list identifying where that token was found. There are different conventions as to how those posting lists work, and whether they include separate document identifiers.
Apple’s patents include several elaborations of basic inverted indexes. Hornkvist and others describe a two-level inverted indexing table with live index, together with an annotated postings list, with update sets and multiple index files with deltas. The two-level table keeps frequent tokens in a small table that is optimised for updates, and less common tokens in a larger table optimised for searching rather than updating.
Sachs and Sagotsky describe a collocation index constructed from an inverted index by determining distances between the occurrence of tokens in posting lists. Those that fall within a specified threshold are then added to the collocation index.
Changing indexes
Most inverted index systems are largely static, but Spotlight’s have to accommodate constant change as files are altered and saved, new files are created, and others are deleted. To enable main inverted indexes to remain well-structured and efficient, Spotlight stores appear to use separate transient posting tables to store recently acquired metadata and content. Periodically data from those is incorporated into more static tables. Similarly, when files are deleted their indexed metadata and contents aren’t removed immediately, but when the store next undergoes housekeeping.
This is likely to explain sustained periods of activity of mds and its helpers, for example in the minutes after startup. This is difficult to establish, as that activity isn’t accompanied by informative entries in the log.
Summary
- Each volume has its own hidden, top-level .Spotlight-V100 folder containing Spotlight indexes for the contents of that volume.
- When files change,
mdworkerprocesses extract metadata and contents for indexing, using the mdimporter plugin for that file type. - Metadata and content appear to be indexed separately.
- Text content is tokenised and filtered using stop words and may be further processed for stems and prefixes.
- Inverted indexes are used, with entries in a dictionary having a postings list specifying the locations of each occurrence.
- More elaborate inverted indexes may be used, separating frequent tokens from those less common.
- Indexes are designed to cope with frequent changes, only incorporating those into more static tables during periodic housekeeping.
- Local Spotlight indexes and indexing are complicated and almost entirely undocumented.
References
US Patent 6,847,959 B1 Universal Interface for Retrieval of Information in a Computer System, Yan Arrouye and Keith Mortensen, filed 5 January 2000, dated 25 January 2005.
US Patent 7,698,328 B2 User-Directed Search Refinement, Matthew G Sachs and Jonathan A Sagotsky, filed 11 August 2006, dated 13 April 2010.
US Patent 7,783,589 B2 Inverted Index Processing, John M Hornkvist and others, filed 4 August 2006, dated 24 August 2010.
Stefan Büttcher, Charles LA Clarke and Gordon V Cormack (2010) Information Retrieval, Implementing and Evaluating Search Engines, MIT Press, ISBN 978 0 262 52887 0.
I’m very grateful to Yogesh Khatri for correcting me about the store.db database (see comments below).

