hoakley August 14, 2021 General, Language, Macs, Technology

How to compare two PDF documents

There are some fundamental tasks we need to do with most if not all documents. One of them is to compare two versions of what are essentially the same document. These might be legal agreements, or revisions of a report, which are quite likely now to come in PDF format. This article explores how you can compare the contents of two PDF files, or perhaps why you can’t.

Comparing PDFs isn’t a feature you’re likely to find in apps which otherwise have rich support for the document format. It’s more likely that they’ll offer some form of redaction but not the ability to make any comparison between two documents. Try Adobe Acrobat Reader, and the tool will be offered, but the only way to obtain it is to upgrade to the full Adobe Acrobat DC, on a monthly subscription. That’s an offer that most will wisely refuse.

Compare text

A free solution is to export each of the documents in the form of text, and use a powerful text editor like BBEdit to compare those text documents. If you have Apple’s free Xcode SDK installed, you could use its FileMerge app, which is hidden away inside the app bundle and accessed through the Open Developer Tool command in the Xcode menu, but I prefer BBEdit’s Find Differences… command in its Search menu.

pdfdiff1

You’ll then discover how variable the text exported from PDF files can be. One experiment worth trying is to make a copy of a text-rich PDF document and open and save it a few times using different apps, but without changing any of its content. This can move chunks of text around, even though when you view the PDF it clearly hasn’t changed at all. So, although you should be able to find all the content, you’re likely to have plenty of false positives, where there are differences between exported text, but not in what you see in the documents themselves.

Paid-for Acrobat

As far as I can see, the only ‘serious’ feature which can compare PDF files is that in the paid-for version of Adobe Acrobat DC. Reaching for my copy, I put it through its paces and discovered that it too is of only limited use for this task. Apart from its standard Martian interface which is thankfully peculiar to Acrobat, small differences between PDFs often trigger hundreds of differences that are reported by Acrobat. If you’ve got all day to work through each page, it might be just the job, but if you want a clean and simple list of differences, you’re likely to be out of luck.

To test this, I took a text document with numbered lines, as is common with many legal documents, and printed it to PDF. I then made a handful of small changes to it, turned that into PDF, and compared the two results.

pdfdiff2

Because Acrobat has no sense of any underlying structure, where the minor changes in the text had caused renumbering of lines, Acrobat flagged every single line as being different. It also picked up all changes in page layout which didn’t involve any change in content: the removal of a single line on the first page of a document thus effectively made the rest of the document a long and tedious series of changes too.

One strength, though, is that Acrobat is reliable at reporting when documents haven’t changed, even though text exported from them has changed in its structure. Beyond that, I didn’t find Acrobat much help, as it just overwhelmed with irrelevant differences.

Room for improvement?

Given the popularity of PDF documents, you’d imagine there’s strong demand for something better for comparisons. However, any solution is doomed to fail unless it can overcome a fundamental design limitation of the PDF format: it doesn’t store content in any form of semantic structure, but merely what’s needed to make each page look right. You can alter that by manually flowing each block of text together, a procedure necessary for some types of PDF which need to be compatible with text readers, for instance, but hardly anyone bothers to do that, and it’s exceptional to discover documents which have been so structured.

Within a PDF file are as many as tens of thousands of objects, each of which contains the code to generate part of a page. If you were to set one word in a paragraph and style it using a different font and weight, the PDF engine may decide to split it out as another object to be placed on that page. But there’s no semantic link between those objects, and individual PDF writers can even place each word on a page independently, as a separate object. Working out how those words assemble into the text would then be a very difficult task even for “an AI”.

Not only that, but being such an old file format, it allows editors to tack objects on at the end of the file, to save having to write the whole file again. Sometimes a PDF engine will ‘flatten’ all those appended changes, which can completely restructure the objects.

The sorry truth is that the PDF format was never designed to provide access to its contents, except to display them correctly on the screen or in a page image for printing. Despite that, the whole world is busy storing millions of its most important documents every day as PDFs. Does that seem ever so slightly crazy?

I’m grateful to Paul for opening this Pandora’s box.

20Comments

Add yours

1

blackxacto on August 14, 2021 at 11:09 am

Is there a difference in display text and text displayed? Trying to understand what PDF considers components of a doc.

LikeLiked by 1 person
- 2
  
  hoakley on August 14, 2021 at 1:49 pm
  
  As I wrote, a PDF document consists of objects. Objects can be as small as a single character – as is often seen with drop caps – or as large as the whole text on one page. That depends entirely on the software that’s generating the PDF. For example, if you take a single-page document with a single text object and divide that up to move a paragraph within that text, you could end up with the text on that page taking three or more objects.
  All the PDF renderer does is render each of the objects on each page. It doesn’t know how they relate to one another, or whether their content is even connected.
  I hope that’s clearer.
  Howard.
  
  LikeLike
  - 3
    
    blackxacto on August 14, 2021 at 3:33 pm
    
    “doesn’t know how they relate to one another, or whether their content is even connected.” Thank you.
    
    LikeLiked by 1 person
4

DaveG on August 14, 2021 at 2:57 pm

Good article and certainly points out a weakness of our present document archival approach.

I wonder if pdf -> image -> ocr provides any value in relating two documents by doing a new pass an objectification. Clearly, it is likely to introduce some but maybe not a lot of OCR errors.

LikeLiked by 1 person
- 5
  
  hoakley on August 14, 2021 at 8:09 pm
  
  Thank you.
  No, OCR destructures documents even more, I’m afraid, and makes it impossible to distinguish things like headers and footers, another common feature of legal and similar documents.
  Howard.
  
  LikeLiked by 1 person
6

John on August 14, 2021 at 4:07 pm

In the legal world from my experience at least (20+ years as a lawyer working in and with other big international firms), everyone uses specialised document comparison software. Unfortunately, there are less than a handful of products on the market for this purpose (such as Litera Compare – https://www.litera.com/products/store/litera-compare/ ), they’re priced accordingly and all of them are Windows only (it’s the main reason why I have VMware Fusion and Windows installed on my Mac at home).

Such software can compare PDFs but we would only use this as a last resort because the results have all of the issues mentioned above. When sending revised documents to clients, other lawyers and others, people will without exception either (i) send a marked-up compare (either in Word or PDF format) generated – always – from the old and new Word versions of the document amended or (ii) a Word document with track changes. I think Word (on Windows at least) has the ability to compare two documents function but it doesn’t have all of the features of the proper compare programs (e.g. the compare programs will identify text which has been moved from one to another in a document and changes in a table but I don’t the built-in Word function can handle that).

LikeLiked by 1 person
- 7
  
  hoakley on August 14, 2021 at 8:10 pm
  
  Thank you. That’s invaluable experience.
  Howard.
  
  LikeLiked by 1 person
8

btown on August 14, 2021 at 5:48 pm

https://draftable.com/compare is by far the best solution I’ve found for this, and it’s a shame it’s not more widely known about. It’s not open-source, and their offline app is Windows only, but its ability to handle multi-page relayouts is far and above Acrobat’s diff functionality, and there’s a free online version that’s reasonably secure so long as you don’t share the secret URL around.

LikeLiked by 2 people
- 9
  
  hoakley on August 14, 2021 at 8:22 pm
  
  Thank you. I’m afraid that I refuse to run up a VM to compare PDFs using software which is almost as expensive as Acrobat, and subscription-only.
  Howard.
  
  LikeLiked by 1 person
10

jeffsyrop on August 14, 2021 at 7:37 pm

This is a really good article and I thank you for it. I’ve been struggling with this problem for years, both as a tech writer and an on-line political writer (on Quora). I love your point,

“Despite that, the whole world is busy storing millions of its most important documents every day as PDFs. Does that seem ever so slightly crazy?”

OMG! Is that ever true!!

And now let me share something that is not THE ANSWER, but has been INCREDIBLY useful to me:

1. Display in 2 separate windows, one on top of the other, 2 versions of a page — the original and an updated page on which a few changes have been made — and display both windows and both pages IN the windows identically, starting at the same line of text.

2. Alt-~(tilde) back and forth between pages, you can instantly see where changes are. Even if changes have pushed text down the page so it no longer lines up with the text on the other page, you can simply scroll up or down in each document so that the next group of text in question is perfectly aligned. Alt-Tabbing quickly, if, in the next paragraph there was a doubled word, e.g., “the the”, by switching back and forth, you’ll see the word “the” jumping around and know exactly where the difference lies.

LikeLiked by 3 people
- 11
  
  hoakley on August 14, 2021 at 8:24 pm
  
  Thank you. That makes excellent sense, tragic though it is to admit that we’ve got to compare by eye.
  Howard.
  
  LikeLiked by 1 person
12

rick@neverslow.com on August 15, 2021 at 4:00 pm

You should check out the (realtively) inexpensive BeyondCompare application from Scooter Software. It willl directly compare pdf files as a native function. Runs on Mac or PC. I’ve used it for years and would never do without it.

LikeLiked by 1 person
- 13
  
  Wil van Antwerpen on August 15, 2021 at 5:16 pm
  
  Another vote for Beyond Compare. Besides macOS and Windows, it also runs on Linux.
  I have used it since 1999 for all my file compare needs.
  Mostly for comparing source code, but it also compares native formats such as PDF files as well.
  
  LikeLiked by 1 person
- 14
  
  hoakley on August 15, 2021 at 10:13 pm
  
  Thank you. Having looked at its specs, nowhere can I see any mention of it being able to show differences in the content of PDFs. Indeed, they aren’t mentioned anywhere, nor are there screenshots of compared PDF content.
  Are you sure that you’re referring to PDF content, such as the text in any given paragraph in a document?
  Howard.
  
  LikeLike
  - 15
    
    Wil van Antwerpen on August 16, 2021 at 9:14 am
    
    Hello Howard,
    They don’t really advertise that well.
    It is described here:
    https://scootersoftware.com/features.php?zz=kb_docxlspdf
    
    You can install a 30 day trial (their trial mechanism is interesting as that you really get 30 days to try.. only when you run it does it count a day)
    
    LikeLiked by 1 person
    - 16
      
      hoakley on August 16, 2021 at 7:06 pm
      
      Thank you.
      That’s simple text recovery – which you can get for free elsewhere, including in my own free Podofyllin. As I’ve explained, it isn’t of much use.
      Howard.
      
      LikeLike
17

Rocky on August 16, 2021 at 5:10 am

PDF is far worse than most people realize. It’s based on PostScript, which at heart is a general-purpose programming language. PDF puts some limits on that programmability, but not nearly enough.

I spent too many years generating PostScript, PDF, and PPD files from both software I wrote and half-baked open-source projects, that sometimes became integral parts of very expensive commercial products. Quickly learned that the power of PostScript is also it’s downfall, and focused on a very limited subset just to retain my sanity. Dumped my ancient copies of Adobe’s language manuals just a couple of years ago. PostScript was unlike any other programming language I wrangled, and that’s saying something.

It’s a miracle that PS/PDF clones like Apple’s ever achieved “good enough” status, but Adobe’s outrageous pricing drove that market. Took many years to banish Adobe’s software and business model from my computers, but I’m glad I did. The rendering and other hiccups are rare now, but I pound my desk in frustration a couple times each year.

Comparing documents, except visually, is still virtually impossible, for all the reasons you described, and more.

PDF’s problems generated a gaggle of subsets, including PDF/A, PDF/X, PDF/UA, Tagged PDF, and others that mostly went nowhere except in very specialized industries.

LikeLiked by 1 person
- 18
  
  hoakley on August 16, 2021 at 7:02 pm
  
  Thank you. My feelings exactly.
  Howard.
  
  LikeLike
19

2J on August 18, 2021 at 2:59 pm

just wanted to thank you for elucidating this subtle yet widespread problem. i’ve struggled with the many limitations of PDF for years.

on a separate but related note, Apple’s PDFkit is better than many other PDF implementations, but nevertheless has been somewhat inconsistent over the years. Notably, it had serious bugs up until 10.15.5, including one that wrote unnecessary data to PDF files resulting in inflated file sizes (aside: to my great annoyance, this was finally fixed in 10.15.5, but not announced in the meager accompanying release notes; I had to determine this manually/empirically).

back to the main issue, the fundamental/underlying weaknesses in the PDF format — is there any hope for reprieve? would more widespread adoption of tagged PDFs** solve (at least most of) the problem? or is a more fundamental redesign of PDF necessary?

** cf. https://en.wikipedia.org/wiki/PDF#Logical_structure_and_accessibility

LikeLiked by 1 person
- 20
  
  hoakley on August 18, 2021 at 6:51 pm
  
  Thank you.
  Yes, my free PDF app Podofyllin uses PDFKit throughout. It has been steadily improving from a nadir a few years ago when a lot was broken during a rewrite.
  PDF does the job it was designed for very well: it renders pages reproducibly. But it’s such an ancient format which was never designed to give access to continuous text in their contents. For that, you need a new format, not just a redesign.
  Howard.
  
  LikeLike