How to compare two PDF documents

There are some fundamental tasks we need to do with most if not all documents. One of them is to compare two versions of what are essentially the same document. These might be legal agreements, or revisions of a report, which are quite likely now to come in PDF format. This article explores how you can compare the contents of two PDF files, or perhaps why you can’t.

Comparing PDFs isn’t a feature you’re likely to find in apps which otherwise have rich support for the document format. It’s more likely that they’ll offer some form of redaction but not the ability to make any comparison between two documents. Try Adobe Acrobat Reader, and the tool will be offered, but the only way to obtain it is to upgrade to the full Adobe Acrobat DC, on a monthly subscription. That’s an offer that most will wisely refuse.

Compare text

A free solution is to export each of the documents in the form of text, and use a powerful text editor like BBEdit to compare those text documents. If you have Apple’s free Xcode SDK installed, you could use its FileMerge app, which is hidden away inside the app bundle and accessed through the Open Developer Tool command in the Xcode menu, but I prefer BBEdit’s Find Differences… command in its Search menu.

pdfdiff1

You’ll then discover how variable the text exported from PDF files can be. One experiment worth trying is to make a copy of a text-rich PDF document and open and save it a few times using different apps, but without changing any of its content. This can move chunks of text around, even though when you view the PDF it clearly hasn’t changed at all. So, although you should be able to find all the content, you’re likely to have plenty of false positives, where there are differences between exported text, but not in what you see in the documents themselves.

Paid-for Acrobat

As far as I can see, the only ‘serious’ feature which can compare PDF files is that in the paid-for version of Adobe Acrobat DC. Reaching for my copy, I put it through its paces and discovered that it too is of only limited use for this task. Apart from its standard Martian interface which is thankfully peculiar to Acrobat, small differences between PDFs often trigger hundreds of differences that are reported by Acrobat. If you’ve got all day to work through each page, it might be just the job, but if you want a clean and simple list of differences, you’re likely to be out of luck.

To test this, I took a text document with numbered lines, as is common with many legal documents, and printed it to PDF. I then made a handful of small changes to it, turned that into PDF, and compared the two results.

pdfdiff2

Because Acrobat has no sense of any underlying structure, where the minor changes in the text had caused renumbering of lines, Acrobat flagged every single line as being different. It also picked up all changes in page layout which didn’t involve any change in content: the removal of a single line on the first page of a document thus effectively made the rest of the document a long and tedious series of changes too.

One strength, though, is that Acrobat is reliable at reporting when documents haven’t changed, even though text exported from them has changed in its structure. Beyond that, I didn’t find Acrobat much help, as it just overwhelmed with irrelevant differences.

Room for improvement?

Given the popularity of PDF documents, you’d imagine there’s strong demand for something better for comparisons. However, any solution is doomed to fail unless it can overcome a fundamental design limitation of the PDF format: it doesn’t store content in any form of semantic structure, but merely what’s needed to make each page look right. You can alter that by manually flowing each block of text together, a procedure necessary for some types of PDF which need to be compatible with text readers, for instance, but hardly anyone bothers to do that, and it’s exceptional to discover documents which have been so structured.

Within a PDF file are as many as tens of thousands of objects, each of which contains the code to generate part of a page. If you were to set one word in a paragraph and style it using a different font and weight, the PDF engine may decide to split it out as another object to be placed on that page. But there’s no semantic link between those objects, and individual PDF writers can even place each word on a page independently, as a separate object. Working out how those words assemble into the text would then be a very difficult task even for “an AI”.

Not only that, but being such an old file format, it allows editors to tack objects on at the end of the file, to save having to write the whole file again. Sometimes a PDF engine will ‘flatten’ all those appended changes, which can completely restructure the objects.

The sorry truth is that the PDF format was never designed to provide access to its contents, except to display them correctly on the screen or in a page image for printing. Despite that, the whole world is busy storing millions of its most important documents every day as PDFs. Does that seem ever so slightly crazy?

I’m grateful to Paul for opening this Pandora’s box.