PDF without Adobe: 2 Why PDF is so odd

Having explained a little of the history of PDF, this article goes on to show you what PDF files look like, and why PDF software is so odd.

Modern document formats on the Mac tend to be based on sophisticated industry standards such as those for images, which change quite rapidly, and those which use XML to package content. PDF is a real odd-ball by comparison, as its origins are in the 1980s, and it hasn’t changed much since the start of the century.

PostScript files, with the extension .ps, start with a prologue containing metadata such as
%!PS-Adobe-3.0 %%Title: c:\output\online.dvi %%Creator: DVIPSONE 0.8 1991 Nov 30 16:22:12 SN 102 %%CreationDate: 1992 Mar 26 10:04:36

They then largely consist of dictionaries of PostScript instructions which are to be used to construct the page being described, such as
%%Page: 3 4 dvidict begin bp % [3] 38811402 d U -34996224 d u -1582039 d U 29614244 r f2(3)s O o 34996224 d u -34340864 d u 8708260 r(of)s 185088 W(abstractions,)s 191757 X(such)S(as)S(\\mixed)S(blessing")S(and)S(\\retaliation,")T(in)S (semantic)S(nets)S(that)s o
and so on. These place each item of text and graphics on that page.

PDF generated by the current version of Safari in macOS Mojave starts by declaring the version used, which may come as a shock
%PDF-1.3
Yes, that’s PDF version 1.3, which was introduced in Adobe Acrobat 4.0 and defined in 1999. The current version is 1.7 (Adobe’s pre-ISO version) or 2.0 (the ISO standard itself).

Then follows the main data, as a series of objects arranged in a flattened tree structure, starting like
4 0 obj << /Length 5 0 R /Filter /FlateDecode >> stream
with a binary stream of data, which is here compressed using the Flate method (an improvement on LZW), terminated by
endstream endobj
which defines object number 4.

Some objects consist of code or definitions, such as
2 0 obj << /Type /Page /Parent 3 0 R /Resources 6 0 R /Contents 4 0 R /Annots 23 0 R >> endobj
which is a Page dictionary.

Somewhere towards the end of the file, you may find an object which contains metadata, such as the PDF engine which built the file:
199 0 obj (macOS Version 10.14.3 \(Build 18D42\) Quartz PDFContext) endobj

Right at the end of the PDF file is the cross reference, which starts like
xref 0 201 0000000000 65535 f 0000091870 00000 n
and ends with a trailer
trailer << /Size 201 /Root 143 0 R /Info 1 0 R /ID [ <981d8a68a004e1bc106cab57df0a065b> <981d8a68a004e1bc106cab57df0a065b> ] >> startxref 91948 %%EOF

When a PDF file is changed by annotation, the contents of each annotation are added to the file as further objects.

Objects, as elements on the page, can be laid out almost randomly, something which often makes converting laid-out columns of text so infuriating. PDF can just drop in blocks of text and images in whatever order they come, which often doesn’t coincide with the original flow in the text. As a PDF file proceeds one page at a time, multiple columns laid out over several pages can be particularly disastrous to extract as text, or to try to reconstitute in any other way.

PDF files are extremely verbose, but their contents are now largely unreadable due to the extensive use of binary streams of data, and all the supporting information. A document containing a single character may thus result in a PDF file of 160 lines, making even expansive XML files look concise in comparison.

In the early days of PDF, much more document content was found in plain text, rather than compressed binary streams. It was then possible to patch defective or damaged PDF files using a text editor such as BBEdit, and that was something that I did on several occasions, to allow recovery of much of the PDF. Although it’s still possible to do this, the chances of success have fallen greatly. If you can’t open a PDF file using a good PDF editor now, its likely to be dead.

It’s also important to remember how old the roots of PDF are. The first volume of the Unicode standard 1.0 wasn’t published until 1991, and its introduction into Mac OS was long delayed after that. Consequently, PDF remains based on 8-bit extended ASCII text, with the main characters in a PDF file still being original 7-bit ASCII. Handling characters is generally accomplished by specifying individual characters in a specific font. This is why font substitution in PDF documents so commonly results in incorrect characters being displayed, with characters outside the extended ASCII set being most vulnerable. In worst cases, it can render entire documents incomprehensible.

Most peculiar of all, though, is the prolonged period of stability in the PDF standard. PDF rendering engines and apps have had nearly twenty years during which relatively little has changed. Stability in the format should have ensured that bugs are few and reliability very high, but it also means that there is little opportunity to add anything new. Without compelling new features, PDF products have stagnated, and getting users to pay to upgrade to a new version must be a futile task.

The original PDF standard has, though, spawned variants:

PDF/A is primarily intended for archival use, and restricts the use of some features which would make future retrieval more difficult. For example, fonts can’t be linked but must be embedded, and encryption is forbidden. This format is encountered quite commonly, and should be handled transparently by all good PDF software.
PDF/E is used in documenting construction, manufacturing and geospatial workflows, and is based on PDF 1.6 rather than 1.7/2.0, so again should be transparent to the user.
PDF/UA is a recent variant which requires content to be tagged in logical reading order, which is particularly valuable for those using assistive technologies. Again, this should be transparent to the user, but offers benefits to those having text content read to them, for example.
PDF/VT is another recent variant intended for variable and transactional printing, with support for ICC Output Intents.
PDF/X, which has numbered sub-variants such as PDF/X-4, has various restrictions imposed on regular PDF to maintain printing-related requirements. PDF/X-4 is most common, and supports colour managed, CMYK, grey, RGB, or spot colour data, and a BleedBox for printing with bleed.

8Comments

Add yours

1

Michele Galvagno on February 13, 2019 at 7:58 am

Thank you for this!
I use PDFs mainly for printing .sib Sibelius files scores for composers so I always wondered why each app had a different way of drawing lines and graphics if PDF were a universal format.

Sometimes lines are too thick, sometimes even the symbols move!!!
I really do not know how to export files for printing in another more reliable way, any thought?

LikeLike
- 2
  
  hoakley on February 13, 2019 at 8:32 am
  
  Thank you, Michele. I share your concerns.
  PDF remains the answer, but you need a reliable workflow using apps which produce the most consistent results.
  At least it’s better than HTML, where anything at all can happen, from an empty page to a jumbled mess!
  Howard.
  
  LikeLiked by 1 person
  - 3
    
    Michele Galvagno on February 13, 2019 at 8:34 am
    
    Sibelius has its own PDF Exporter engine built-in but during the transition from QT4 to QT5 framework some things went wrong and I now rely on using the macOS printer to export a PDF of what is seen on screen.
    
    LikeLiked by 1 person
4

Victor Maurice Faubert on February 13, 2019 at 8:31 pm

Thank you very much for this explanation. I had no idea! It makes a lot of problems I’ve had with extracting text from PDF’s much clearer.

LikeLiked by 1 person
5

Biff on February 15, 2019 at 1:22 pm

What a spectacularly well done series of posts! Thank you!

LikeLiked by 1 person
- 6
  
  hoakley on February 18, 2019 at 11:12 pm
  
  Thank you for your kind words.
  Howard.
  
  LikeLike
7

Mike on February 28, 2019 at 10:01 pm

I just found this post from a search for “macOS Version 10.14.3 (Build 18D42) Quartz PDFContext”, as I’m having a new problem when exporting a PowerPoint presentation to PDF on OSX Mojave. I’m praying someone here can help.

A couple of weeks ago, when I exported a PPT to PDF it came out looking 100% like the presentation. But all of a sudden, when I now export, all of the shadows are dark black and look awful, compared to the subtle shadows in the presentation.

I noticed that the old PDFs that look good, show this encoding software in the info pane on OSX:
Mac OS X 10.13.5 Quartz PDFContext

The new PDFs made from the same PPT file that look terrible, show this:
macOS Version 10.14.3 (Build 18D42) Quartz PDFContext

I’m GUESSING this has to do with the bad quality export. Is there any way to fix this? I obviously cannot revert back to an older version of Mojave easily if that’s the issue. Can I change what encoding software OSX uses somehow?

My job relies on this, so any help would be GREATLY appreciated. Thank you!!

LikeLiked by 1 person
- 8
  
  hoakley on February 28, 2019 at 10:17 pm
  
  Your earlier PDF looks as if it was actually made in High Sierra, not Mojave. However, without having internal detail from it, it’s impossible to be sure.
  You can’t change that directly, I’m afraid. The only way you could generate a PDF with an older version of the PDF engine would be to do that on a Mac running an older version of macOS – according to those metadata, High Sierra 10.13.5 to be precise.
  You do have other options from a PowerPoint file. There are two different methods within the PowerPoint app – Export (as PDF), and Print, then select Print as PDF from the PDF popup menu at the lower left of the Print dialog. They may well produce quite different results.
  There is another option, which is to import your PowerPoint presentation into Keynote, which also has similar two options of exporting to PDF, or printing to it.
  Within any of those PDF export or output options, you will sometimes have control over the settings to be used. From the sound of it, the conversion to PDF used something like low quality black and white rather than high quality colour. Watch very carefully during the export/print process to ensure that you’re using the highest quality options.
  I hope that one of those sorts your problems.
  Howard.
  
  LikeLike

Share this:

Related