hoakley February 1, 2024 Macs, Technology

PDF, Live Text and Spotlight: troubled relationships

Given its original purpose, age, and erratic development, we have unrealistic expectations of our PDF documents. Just because they have become something of a de facto standard for archiving important documents doesn’t mean that PDF is in the least bit suitable. Neither does it mean that Live Text can necessarily extract any of their contents, that they are indexable by Spotlight, or searchable. This article shows some examples that may make you reconsider PDF as an archival format.

In each of these screenshots, the PDF document is shown in my free PDF viewer Podofyllin. At the left are the page thumbnails, in the centre the document as it’s normally rendered and shown to the user. At the right is an area for the display of text content within that PDF.

The first of these demonstrates how PDF, Live Text and ultimately Spotlight can work together when they want to. This document is one of a series that I created to test OCR software, as it has an image of a text page that has been blurred in a controlled way, to model what can happen during printing and scanning. As there’s no text content shown at the right, this document consists of just the image.

Although it looks quite blurry, and there’s no OCR text to work with, Live Text recognises this perfectly. As a result, a document containing just the image can be searched, copied, and used every bit as well as if the PDF had been generated straight from software. In this case, if the document’s text was to be added to Spotlight’s indexes, it would become fully searchable using Spotlight.

Here is a PDF document containing the scanned images of the article I presented here yesterday available from the Internet Archive, and published in print 30 years ago. Although no OCR has been performed, and there’s no text content at the right, Live Text does pretty well in identifying and recognising text within the page image. This generally works well with search, although words that are hyphenated to span lines won’t be found. About half way down the right column, the word clearly has been split into two, and that wouldn’t be found by searching the text recovered by Live Text.

The problem with this document is that Live Text isn’t as good as one of the better OCR products at assembling text within each page into its logical order. As shown in the highlighting here, try to copy the first few lines of the column at the left, and you end up with the interleaved contents of two columns. Unravelling them is tedious to say the least.

This PDF is a real problem. Although I here show only a four-page excerpt of the original, you can view and download the whole of this from arXiv. As revealed in the text content at the right, almost nothing has been recovered from its PDF data. That isn’t because this document has been imported into PDF from images, it was actually generated straight to PDF from LaTeX via the hyperref package, but the characters used aren’t representable.

Although you can read the PDF perfectly clearly, Live Text doesn’t work on it, and you can neither select its content to copy it, nor search it for words. As there’s nothing for Spotlight to index here, either from the PDF mdimporter or Live Text, the whole contents of this document are beyond reach.

Even when used with OCR that runs Live Text recognition in ‘accurate’ mode, it presents serious problems. Half of it is written in Polish, for which Live Text needs additional linguistic information that would normally be downloaded on demand. Even then recognition falls apart when it hits one of the many equations and other mathematical content in this thesis.

This example generated by the Debenu Quick PDF Library is another parting of the ways between PDF, Live Text and Spotlight. What appears in the centre to be perfectly readable French turns out to be full of Unicode private codepoints, as shown at the right. Select some of the text in the central rendered document, copy it, and all you end up with is a long string of those infernal 􏰲 characters. It is neither searchable nor can it be indexed by the PDF mdimporter. Because of the unusual encoding scheme used in the PDF, this document too is inaccessible.

There are many other examples of the problems faced by Live Text and Spotlight when dealing with PDF documents. Another common pattern results from pages that were laid out using high-end publishing software, which frequently scatters the letters in each word into separate data within the PDF, resulting in a jumble of letters and fragments of words. Fortunately, that is tending to decrease now, as greater emphasis is being placed on accessibility of PDF documents and in PDF/A standards, but it’s all too common in documents of the past.

Compared to the challenges posed by PDF documents, recovering text from images is normally far more straightforward, even though it still appears to be magical. You can read more about some of the vicissitudes of PDF in the articles listed on this page.

21Comments

Add yours

1

rfog926695139 on February 1, 2024 at 9:24 am

I believe this ongoing issue stems from Apple losing touch with their legacy as pioneers in PDF technology. It seems that with each of the last four macOS updates, the native PDF handling has become increasingly riddled with bugs.

The frustration is palpable among DEVONthink forum members, who are weary of voicing their grievances. The glitches range from rendering mishaps to annotation errors that inexplicably inflate a PDF’s size from a mere 1KB to a whopping 100MB just by adding a highlighted line. My advice? It might be time to set aside macOS’s built-in PDF tools and switch to a third-party solution. I’ve personally set PDF Expert as my default viewer; it’s a bit pricey, but its performance justifies the cost.

(Text improved with AI, I’ve never written so well :-) )

LikeLiked by 1 person
- 2
  
  hoakley on February 1, 2024 at 12:33 pm
  
  Thank you.
  
  I’m afraid that I don’t agree: PDF is riddled with serious problems. Only with newer standards such as PDF/A do they start to resolve. Preview is the worst culprit in macOS, rather than PDF rendering or handling in Quartz and macOS. It’s not uncommon for my app Podofyllin to handle PDFs better than Preview, which is one of the reasons that I wrote it. (The main reason, though, is that Preview updates PDF documents whenever it opens them, something I don’t want a PDF viewer to do.) I think that PDF Expert still uses the Quartz PDF engine – there isn’t much choice there, as it’s the only one you don’t have to pay hefty licensing fees to use.
  
  Howard.
  
  LikeLike
  - 3
    
    rfog926695139 on February 1, 2024 at 1:29 pm
    
    H, I cannot discuss with you because your macOS knowledge is one billion times over mine. I’ve downloaded the Arxiv file, opened with PDF Expert and it is a caotic (I think) multi layer contraption. With PDF Expert you can flatten it, and then at least selection is done as expected. File size after flatten is even the same. Xodo (under Windows) opens and allows select from it well, and Edge in Windows present the same issues as Preview here in macOS.
    
    LikeLiked by 1 person
    - 4
      
      hoakley on February 1, 2024 at 1:43 pm
      
      Of course you can discuss: that’s what these comments are for. Just as my knowledge might enlighten, so your experience and views are important contributions, and valued.
      PDF Expert sells as a PDF editor. Although I think it does rely on the Quartz (i.e. macOS built-in) PDF rendering engine, it does a lot of more clever things, which is why I use it. I also use Adobe’s Acrobat Pro, which has even more goodies, although it’s designed for Martians!
      My problems with Preview are in the extras it tries to provide, which are almost invariably bug-ridden and affected by its nasty habits. Just as I wouldn’t use Preview to view or edit PDFs, I don’t use it for images either (except as a quick viewer), because it’s inadequate there too.
      Sadly, because Preview is bundled and free, many who use Macs try to use it with PDFs, which I think is most unwise. If you take PDFs seriously, then you really must dip your hand into your pocket and pay for something that’s up to the job. That’s a sad reflection on Apple’s neglect of Preview, which could be so much better, and there we’re on common ground.
      To me, the irony of this is that the PDF engine built into macOS, which is one of few other than Adobe’s, is so much more capable, but Preview doesn’t use it well, and gives the impression that PDF in macOS is broken. Yes, there are some oddities and have been some bugs, but the good folk at PDF Expert and other indie developers work round those, whereas Preview normally walks straight into them, hurting the user.
      Howard.
      
      LikeLike
5

tempelmann@gmail.com on February 1, 2024 at 11:19 am

1. When I recently explored using Live Text recognition in FindAnyFile, I also found that words broken up at a line boundary cause problems with search. And that’s not even only a problem with Live Text but with Spotlight importers as well! When you check the successfully indexed text of a PDF with `mdimport -t -d3 /path/to/file | grep kMDItemTextContent`, you can find the same broken-up words in there. You’ll find that there’s an extra space or a LF (depending on macOS/importer version) in the index at the point where the word was broken. And you’ll see that Spotlight is then unable to find the complete word, too.

2. I am surprised that Live Text fails to read the last two PDFs at all. I thought it would simply OCR the text, like it does when it finds text in images. Why would that ever fail, especially if the text is clearly readable like in your examples? Do you have a guess?
In fact, when I screenshot a page and then open that in Preview, and change the selection mode (Tools menu) to Text Selection, I can indeed select and copy the text just fine. So I suppose that your PDF reader doesn’t use Live Text right in this case? I’d think that you’d render the page into an image and then let the text recognizer work on it. Perhaps you need to provide a switch that the user can set to choose whether to get the text from Spotlight’s importer or from Live Text.

LikeLiked by 1 person
- 6
  
  hoakley on February 1, 2024 at 12:27 pm
  
  Thank you, Thomas. I was hoping this would spark your interest. Have you downloaded that thesis from arXiv to test? I’m sorry, I can’t distribute the French example, but that too is apparently legal PDF.
  
  We have to remember that, from its inception, PDF has been designed to preserve layout rather than the flow of its content. When it breaks lines or words, then those are hard breaks; there’s nothing to stop a ‘smart’ recognition engine from trying to restore them, but that requires insight into the context, which PDF can’t provide. So any text extracted from PDF is going to suffer the same limitations.
  
  Formats that are content-dominant and more flexible in layout, like Rich Text, shouldn’t embed those word breaks in their content, but leave it up to the rendering engine to insert them where appropriate. Although I haven’t checked (and I’m not sure how good the macOS RTF mdimporter is now), I think you’ll find there’s no such problem with soft line breaks or word breaks in RTF, or in XML-based document formats either.
  
  What I think is happening with Live Text is that it, in these cases, it isn’t working on the pixel map of the display, but looking inside the rendered image, in this case by the Quartz PDF renderer, which contains the code points being displayed. As those two documents have embedded code points that aren’t visible, we don’t see what Live Text does. It chokes on those invisible code points, and fails to extract usable text.
  
  Of course, if an app were to render the PDF into a Quartz PDFView and then render that as an image, as an additional step, then Live Text would only see what we do. But that’s not the way that apps render PDF – and the failure of Live Text in those two cases is the same whether you view them in Preview or Podofyllin. While I don’t know how Preview renders PDF, I do know how Podofyllin does, and that’s as a PDFView, which works fine with Live Text in other cases.
  
  As you’ll see in those two cases, the text available from the PDF file (as seen by the PDF mdimporter) is also useless, so having an option to use that doesn’t address the problem at all. The problem lies in the PDF format, which was never designed to deliver its content in any coherent way. This should be addressed by newer standards such as PDF/A, but of course a vast number of PDFs aren’t anywhere near PDF/A compliant. That’s generally worse with older documents, which of course are those we most want to archive, and search. You can, with a good PDF editor, eventually make them better-searchable, but that’s a significant amount of manual work for each affected file.
  
  Howard.
  
  LikeLike
  - 7
    
    tempelmann@gmail.com on February 1, 2024 at 12:32 pm
    
    Howard, I was trying to suggest that you improve your own PDF viewer, that you pointed out in your article, to use Live Text by rendering the pages into images so that your viewer could search in PDFs where Spotlight fails. If you don’t do that, I hope someone else will make such a PDF viewer and advertise it for its special feature.
    
    LikeLiked by 1 person
    - 8
      
      hoakley on February 1, 2024 at 12:40 pm
      
      Sorry, I don’t think there’s a market that would pay for that. Even full-featured PDF editors are in a marginal position, with the dominance of Adobe and their own rendering engine.
      Podofyllin is all about not changing the PDF files it views (which is one of Preview’s worst habits) and looking inside the structure and source code of the PDF.
      Howard.
      
      LikeLike
    - 9
      
      hoakley on February 1, 2024 at 1:56 pm
      
      Thinking a bit more about this: I don’t think this is a good way to go.
      
      Normally, when you select the text in a rendered PDF file, what you’re selecting doesn’t go anywhere near Live Text, so is ‘recognised’ perfectly, as no recognition is involved. All that macOS does it take that text from the original that has been passed to Quartz to render in the view.
      
      If an app were to take that Quartz view, which is already in PDF, and to render it to a pixel map for Live Text, it doesn’t address any of the other issues such as line and word breaks, or text columns. All it does is create extra work in then converting that pixel map back into what it was in the first place, with the omission of code points that aren’t visible, and in doing so presents problems when scaling its pixel map image to the view.
      
      What would be much better, and more efficient, would be a filtering layer between the PDF to be rendered and Quartz, that removed all content that shouldn’t be sent for rendering, and that’s a really challenging job that might not even be possible in macOS. I think it’s a neat idea that wouldn’t implement, or if it were to, would require serious engineering resources. It’s the sort of thing that Apple could do, but being US-based doesn’t see any need, as these issues are most common with non-English languages.
      
      Howard.
      
      LikeLike
10

Paul S. on February 1, 2024 at 2:00 pm

This PDF document is written in Polish. LiveText does not support Polish, as described here: https://www.apple.com/macos/feature-availability/#live-text

You seem to have edited the document and removed most of the Polish. That English page 3 is really page 29. If you save English page 29 as a TIFF file, then LiveText works perfectly.

LikeLiked by 1 person
- 11
  
  hoakley on February 1, 2024 at 2:07 pm
  
  Thank you.
  Strange as it might seem, many documents contain more than one language. That thesis isn’t just written in Polish, but in English too, around half of it. Last time I checked, Live Text supported English, even non-US English.
  The moment that you export the document in an image (non-PDF) format, of course it can be accessed by Live Text. So are you going to archive all your documents as volumes of TIFF images of their pages?
  That’s precisely the point – PDF, Live Text and Spotlight don’t get on well together in many documents.
  Howard.
  
  LikeLike
12

Andrew Reilly on February 2, 2024 at 3:22 am

It’s a long time since I’ve made any PDF files from LaTeX, but back then I did notice that they were generally not searchable: the default even-right alignment meant that the lines were full off in-word spacing and kerning, to make it look good. Also, as you’ve mentioned, the encoding predates UTF and does a bunch of non-modern things. None of which mattered at the time, because looking the same on screen or page was all that it was about.

However search-ability is clearly important now, so I just had a look to see if the situation has improved at all. Apparently a little. There are modern (La)TeX variants that can use an underlying UTF encoding, and a macro package (cmap) that purports to make generated documents “searchable and copyable”: https://www.ctan.org/tex-archive/macros/latex/contrib/cmap No idea how well that works; I haven’t used it yet.

None of which guarantees that any particular PDF document found in the wild will be helpful. I’ve encountered contract documents that appear to have been processed to make them un-copyable by reversing the order of all of the glyphs on each page, in the file. Now, that’s just annoying!

LikeLiked by 1 person
- 13
  
  hoakley on February 2, 2024 at 8:06 am
  
  Thank you. I think this may result from the fact that the common route from LaTeX to PDF is via DVI, a format that’s even older than PDF.
  Howard.
  
  LikeLike
14

John Gilbert on February 3, 2024 at 2:23 am

My experience of text recognition with the English-Polish arXiv document.

First, why do I want text recognition? 1) So that Spotlight can find files based on content, 2) So I can search in Preview or Skim or other PDF reader, 3) select text when viewing in a PDF reader. I am not concerned about Live Text in this context.

For this I use OCR products. I find a) Nitro PDF Pro (expensive) to do a good job, and b) OwlOCR (very much cheaper) does a somewhat better job. OwlOCR uses the macOS inbuilt text recognition engine. Both output a PDF with the OCR layer as well as the original. The OCR layer includes both the Polish and English text. Neither does very well with the diacritical marks in Polish words, just OCR to mostly non-diacritical text.

For both the resultant PDFs, 1) Spotlight very successfully finds the documents when using Finder search on both English and Polish words, 2) I can search for text in Preview and Skim, and 3) I can select and copy text in Preview and Skim (for this Preview is better than Skim).

So a simple OCR satisfies my three stated needs. In all my tests, output from OwlOCR was slightly more successful than that from Nitro.

In seems to me that it would be straight forward for Apple to add OCR to either Preview or to the Spotlight importer for PDFs. In the meantime, Nitro can be automated with AppleScript and OwlOCR provides Finder Quick Actions. I have also automated with Hazel, but (for my use) not worth the effort.

LikeLiked by 1 person
- 15
  
  hoakley on February 3, 2024 at 4:13 pm
  
  Thank you.
  So did you run your OCR tests on the old article of mine that’s laid out in columns? How did those apps fare with identifying the columns, separating the text correctly from them, and fusing word breaks?
  One Polish-English thesis does not a summer make.
  Howard.
  PS the OCR in macOS uses the Live Text engine, normally run in its second ‘accurate’ mode, which I don’t think is the mode it normally uses on images in apps like Preview, nor in image mdimporters.
  
  LikeLike
  - 16
    
    John Gilbert on February 4, 2024 at 12:54 am
    
    OCR doesn’t do so well with multi-column images. I downloaded the six JP2 images. OwlOCR working as expected. The results are good for searching (from Finder or within the document), but not for my 3rd requirement – selecting text.
    
    For simpler (wider spaced) multi-column pages from a recipe book, I have found OwlOCR-ed PDFs good for selecting sections of text.
    
    ps. Your conversation with templelmann is fascinating and increases my understanding.
    
    LikeLiked by 1 person
    - 17
      
      hoakley on February 4, 2024 at 1:00 pm
      
      Thank you. No doubt you’ll recall the older OCR apps that checked column and text layout page by page to address those problems. As far as I can tell, even in ‘accurate’ mode as used by OCR apps, Live Text doesn’t do that. Maybe that will come in the future.
      Howard.
      
      LikeLike
    - 18
      
      hoakley on February 4, 2024 at 1:10 pm
      
      Thank you also for pointing out the error in my article’s title. I was up a bit late, but it gave me a chance to correct it, although bflost16 somehow does seems more appropriate!
      Howard.
      
      LikeLike
19

Milo on February 5, 2024 at 12:51 pm

Now I’m curious. In what way does Preview.app change PDFs when they are opened only for reading?

LikeLiked by 1 person
- 20
  
  hoakley on February 5, 2024 at 12:58 pm
  
  I don’t know that it does any more, but every time Preview opened a PDF in a previous version of macOS, it changed the file modification date. Even now, when you open a PDF the Save command is enabled immediately, although no changes have been made to the document. I’m not prepared to put my PDFs at risk like that: that’s why I wrote Podofyllin, that can’t write to the original PDF, no matter what you do.
  Howard.
  
  LikeLike
  - 21
    
    Milo on February 5, 2024 at 1:37 pm
    
    There seems to be some interaction between Preview and Spotlight, which restults in the ctime being updated. I could not detect any changes to the actual file content though.
    
    There is one explanation here: https://apple.stackexchange.com/questions/192014/why-does-preview-change-the-ctime-of-a-pdf-and-how-can-i-disable-it
    
    I have to be honest, it does not bother me too much. But it’s good to know about this. Thank you.
    
    LikeLiked by 1 person