PDF, Live Text and Spotlight: troubled relationships

Given its original purpose, age, and erratic development, we have unrealistic expectations of our PDF documents. Just because they have become something of a de facto standard for archiving important documents doesn’t mean that PDF is in the least bit suitable. Neither does it mean that Live Text can necessarily extract any of their contents, that they are indexable by Spotlight, or searchable. This article shows some examples that may make you reconsider PDF as an archival format.

In each of these screenshots, the PDF document is shown in my free PDF viewer Podofyllin. At the left are the page thumbnails, in the centre the document as it’s normally rendered and shown to the user. At the right is an area for the display of text content within that PDF.

pdflivetext1

The first of these demonstrates how PDF, Live Text and ultimately Spotlight can work together when they want to. This document is one of a series that I created to test OCR software, as it has an image of a text page that has been blurred in a controlled way, to model what can happen during printing and scanning. As there’s no text content shown at the right, this document consists of just the image.

Although it looks quite blurry, and there’s no OCR text to work with, Live Text recognises this perfectly. As a result, a document containing just the image can be searched, copied, and used every bit as well as if the PDF had been generated straight from software. In this case, if the document’s text was to be added to Spotlight’s indexes, it would become fully searchable using Spotlight.

pdflivetext2

Here is a PDF document containing the scanned images of the article I presented here yesterday available from the Internet Archive, and published in print 30 years ago. Although no OCR has been performed, and there’s no text content at the right, Live Text does pretty well in identifying and recognising text within the page image. This generally works well with search, although words that are hyphenated to span lines won’t be found. About half way down the right column, the word clearly has been split into two, and that wouldn’t be found by searching the text recovered by Live Text.

The problem with this document is that Live Text isn’t as good as one of the better OCR products at assembling text within each page into its logical order. As shown in the highlighting here, try to copy the first few lines of the column at the left, and you end up with the interleaved contents of two columns. Unravelling them is tedious to say the least.

pdflivetext3

This PDF is a real problem. Although I here show only a four-page excerpt of the original, you can view and download the whole of this from arXiv. As revealed in the text content at the right, almost nothing has been recovered from its PDF data. That isn’t because this document has been imported into PDF from images, it was actually generated straight to PDF from LaTeX via the hyperref package, but the characters used aren’t representable.

Although you can read the PDF perfectly clearly, Live Text doesn’t work on it, and you can neither select its content to copy it, nor search it for words. As there’s nothing for Spotlight to index here, either from the PDF mdimporter or Live Text, the whole contents of this document are beyond reach.

Even when used with OCR that runs Live Text recognition in ‘accurate’ mode, it presents serious problems. Half of it is written in Polish, for which Live Text needs additional linguistic information that would normally be downloaded on demand. Even then recognition falls apart when it hits one of the many equations and other mathematical content in this thesis.

pdflivetext4

This example generated by the Debenu Quick PDF Library is another parting of the ways between PDF, Live Text and Spotlight. What appears in the centre to be perfectly readable French turns out to be full of Unicode private codepoints, as shown at the right. Select some of the text in the central rendered document, copy it, and all you end up with is a long string of those infernal 􏰲 characters. It is neither searchable nor can it be indexed by the PDF mdimporter. Because of the unusual encoding scheme used in the PDF, this document too is inaccessible.

There are many other examples of the problems faced by Live Text and Spotlight when dealing with PDF documents. Another common pattern results from pages that were laid out using high-end publishing software, which frequently scatters the letters in each word into separate data within the PDF, resulting in a jumble of letters and fragments of words. Fortunately, that is tending to decrease now, as greater emphasis is being placed on accessibility of PDF documents and in PDF/A standards, but it’s all too common in documents of the past.

Compared to the challenges posed by PDF documents, recovering text from images is normally far more straightforward, even though it still appears to be magical. You can read more about some of the vicissitudes of PDF in the articles listed on this page.