PDF without Adobe: 2 Why PDF is so odd

Having explained a little of the history of PDF, this article goes on to show you what PDF files look like, and why PDF software is so odd.

Modern document formats on the Mac tend to be based on sophisticated industry standards such as those for images, which change quite rapidly, and those which use XML to package content. PDF is a real odd-ball by comparison, as its origins are in the 1980s, and it hasn’t changed much since the start of the century.

PostScript files, with the extension .ps, start with a prologue containing metadata such as
%%Title: c:\output\online.dvi
%%Creator: DVIPSONE 0.8 1991 Nov 30 16:22:12 SN 102
%%CreationDate: 1992 Mar 26 10:04:36

They then largely consist of dictionaries of PostScript instructions which are to be used to construct the page being described, such as
%%Page: 3 4
dvidict begin bp % [3]
38811402 d U
-34996224 d u
-1582039 d U
29614244 r
f2(3)s O o
34996224 d u
-34340864 d u
8708260 r(of)s
185088 W(abstractions,)s
191757 X(such)S(as)S(\\mixed)S(blessing")S(and)S(\\retaliation,")T(in)S
(semantic)S(nets)S(that)s o

and so on. These place each item of text and graphics on that page.

PDF generated by the current version of Safari in macOS Mojave starts by declaring the version used, which may come as a shock
Yes, that’s PDF version 1.3, which was introduced in Adobe Acrobat 4.0 and defined in 1999. The current version is 1.7 (Adobe’s pre-ISO version) or 2.0 (the ISO standard itself).

Then follows the main data, as a series of objects arranged in a flattened tree structure, starting like
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>

with a binary stream of data, which is here compressed using the Flate method (an improvement on LZW), terminated by

which defines object number 4.

Some objects consist of code or definitions, such as
2 0 obj
<< /Type /Page /Parent 3 0 R /Resources 6 0 R /Contents 4 0 R /Annots 23 0 R

which is a Page dictionary.

Somewhere towards the end of the file, you may find an object which contains metadata, such as the PDF engine which built the file:
199 0 obj
(macOS Version 10.14.3 \(Build 18D42\) Quartz PDFContext)

Right at the end of the PDF file is the cross reference, which starts like
0 201
0000000000 65535 f
0000091870 00000 n

and ends with a trailer
<< /Size 201 /Root 143 0 R /Info 1 0 R /ID [ <981d8a68a004e1bc106cab57df0a065b>
<981d8a68a004e1bc106cab57df0a065b> ] >>

When a PDF file is changed by annotation, the contents of each annotation are added to the file as further objects.

Objects, as elements on the page, can be laid out almost randomly, something which often makes converting laid-out columns of text so infuriating. PDF can just drop in blocks of text and images in whatever order they come, which often doesn’t coincide with the original flow in the text. As a PDF file proceeds one page at a time, multiple columns laid out over several pages can be particularly disastrous to extract as text, or to try to reconstitute in any other way.

PDF files are extremely verbose, but their contents are now largely unreadable due to the extensive use of binary streams of data, and all the supporting information. A document containing a single character may thus result in a PDF file of 160 lines, making even expansive XML files look concise in comparison.

In the early days of PDF, much more document content was found in plain text, rather than compressed binary streams. It was then possible to patch defective or damaged PDF files using a text editor such as BBEdit, and that was something that I did on several occasions, to allow recovery of much of the PDF. Although it’s still possible to do this, the chances of success have fallen greatly. If you can’t open a PDF file using a good PDF editor now, its likely to be dead.

It’s also important to remember how old the roots of PDF are. The first volume of the Unicode standard 1.0 wasn’t published until 1991, and its introduction into Mac OS was long delayed after that. Consequently, PDF remains based on 8-bit extended ASCII text, with the main characters in a PDF file still being original 7-bit ASCII. Handling characters is generally accomplished by specifying individual characters in a specific font. This is why font substitution in PDF documents so commonly results in incorrect characters being displayed, with characters outside the extended ASCII set being most vulnerable. In worst cases, it can render entire documents incomprehensible.

Most peculiar of all, though, is the prolonged period of stability in the PDF standard. PDF rendering engines and apps have had nearly twenty years during which relatively little has changed. Stability in the format should have ensured that bugs are few and reliability very high, but it also means that there is little opportunity to add anything new. Without compelling new features, PDF products have stagnated, and getting users to pay to upgrade to a new version must be a futile task.

The original PDF standard has, though, spawned variants:

  • PDF/A is primarily intended for archival use, and restricts the use of some features which would make future retrieval more difficult. For example, fonts can’t be linked but must be embedded, and encryption is forbidden. This format is encountered quite commonly, and should be handled transparently by all good PDF software.
  • PDF/E is used in documenting construction, manufacturing and geospatial workflows, and is based on PDF 1.6 rather than 1.7/2.0, so again should be transparent to the user.
  • PDF/UA is a recent variant which requires content to be tagged in logical reading order, which is particularly valuable for those using assistive technologies. Again, this should be transparent to the user, but offers benefits to those having text content read to them, for example.
  • PDF/VT is another recent variant intended for variable and transactional printing, with support for ICC Output Intents.
  • PDF/X, which has numbered sub-variants such as PDF/X-4, has various restrictions imposed on regular PDF to maintain printing-related requirements. PDF/X-4 is most common, and supports colour managed, CMYK, grey, RGB, or spot colour data, and a BleedBox for printing with bleed.