hoakley April 22, 2020 Macs, Technology

File Integrity 7 : Which other file formats are resilient?

The advice I’ve always given about saving archive copies of important documents is to keep at least two copies, one in the normal finished format, and another in a common exchange format such as PDF. It’s also very useful if you can keep individual components of compound documents, such as separate text and illustrations. That advice is based largely on maintaining accessibility in the future: app-specific formats come and go, and that strategy should give you or your descendants the greatest chance of being able to access that document in years to come.

What I haven’t examined objectively is which formats to choose, where there is choice. In some areas such as audio and video, there’s a bewildering variety of options which are best left to the expert. This article looks at the resilience to corruption of some more general formats, using my tool Vandal.

In each case, I have assessed the effects of adding low levels of corruption, from 2 bytes to 1 MB test files, up to 40 bytes.

Text, CSV, etc.

The most basic common format that can be used for a great many documents is plain text, which is now Unicode UTF-8 rather than ASCII of course. Although it’s possible for corruption to generate text which isn’t valid UTF-8, good text editors such as BBEdit can cope with pretty well anything you throw at them. So whatever happens by way of corruption, you should always be able to open plain text files. This applies to text-based formats such as CSV, which can be used to contain data from spreadsheets and databases.

The one snag with text-based formats is that corruption can be hard to spot. In a spreadsheet saved as CSV, cell contents may become changed as a result of corruption, and those changes may be impossible to detect. Wherever possible, use check totals for rows and columns to help identify any introduced errors before exporting to CSV.

More structured text such as XML and JSON is also relatively robust, but has to be parsed in order to import and do anything useful with it. Maybe at some time in the not-too-distant future someone will come up with more intelligent tools which successfully repair damage in structured formats such as these, but at least you can do so given some knowledge, time and patience.

Styled text – RTF

Rich Text Format is a 33 year-old text-based format with limited ability to contain images and other rich media, and markup which is terse, not intended to be edited by humans, but has simple syntax. Even with quite extensive curruption, up to 40 bytes across a 1 MB file, it could be opened, viewed and corrected using simple RTF editors like TextEdit and my own DelightEd. It thus appears as robust a format as plain text, particularly when packaged as RTFD.

In recent years, Apple has preferred to use file formats which consist of structured folders, or packages, pretending to be files. An early example (which some users hate) is its RTFD, which it designed to extend the rich media which can be embedded in Rich Text documents without violating compatibility. Styled text content is contained within a conventional RTF file, and media files are stored separately within the RTFD folder. This is particularly resilient to corruption, when compared with single file formats which embed rich media within styled text.

Archived styled text without significant images or other content is therefore a good choice, as is RTFD where the document includes images and other non-textual content.

Microsoft Office

One of the hopes when Microsoft moved away from proprietary binary file formats to more open ones using XML was that their contents would become more accessible. To a certain extent they have, but this hasn’t been reflected in any improvement in their resilience. Even low levels of corruption can cause Microsoft’s apps (and Apple’s Pages and Numbers) to report unrecoverable errors, and abandon all attempts to recover the damaged document. Sometimes repair does work, though.

In testing here, Excel was able to open documents with 2 corrupted bytes in 1 MB, but no more; Word failed on 2 bytes, but succeeded with one document corrupted in 4 bytes, but no more. Trying Text Recovery in Word was a futile exercise which consistently resulted in useless gibberish.

This is very frustrating, and makes .xlsx and .docx formats poor choices in the event of corruption.

PDF

My expectations for the resilience of PDF documents were low. When the format was first released, most of its internal objects were still commonly stored in plain text form, and I became well used to editing them to fix broken PDFs for magazine readers and others. Since then, almost all PDF objects have become compressed using ‘flate’, and manual repair requires tools not currently available for macOS. I expected those compressed objects to be vulnerable to the effects of single byte corruption, and to readily make whole documents unusable. I was wrong.

Because of their widespread use in archives, I tested five very different PDF documents, each of just over 1 MB size, at five different levels of corruption, from 2 bytes up to 40. I tried opening them with Preview (macOS 10.15.4) which isn’t noted for its performance, and three dedicated PDF apps, Adobe Acrobat ‘Pro’ DC, PDFpenPro, and PDF Expert.

Only one of the five test documents failed to open reliably with 2 corrupt bytes. With 4 corrupt bytes, another started to lose content but could still be opened. At 20 bytes, one document was badly mutilated with much of its content lost, and one failed completely, and at 40 bytes all five of the documents had obvious problems.

One notable success was the resilience of PDF documents made from Keynote presentations, consisting of just a series of slides. Although some of the images became scrambled at higher levels of corruption, slideshows remained relatively unscathed.

The big problem with a damaged PDF is the lack of tools to attempt repair. For the cost of Acrobat ‘Pro’, it’s shocking that it has no real tools for repair of corrupt PDFs, and just complained of their damage.

Recovering content from damaged PDFs isn’t a good plan, but they do appear significantly more resilient than Microsoft Word’s .docx format, for example. Provided they are accompanied by separate text and graphics, they do appear a good choice for archiving documents. Perhaps we should also start encouraging our grandchildren to learn how to repair them too, as I’m sure there are going to be plenty of jobs for skilled repairers of PDF in the future.

Conclusions

Controlled corruption of a range of test documents in different formats didn’t bring any great surprises: basic formats using UTF-8 plain text remain good choices where they can be useful. RTF and RTFD should also be considered, as they appear to be relatively resilient. PDF isn’t as bad as I had feared, and is probably still the best choice for maintaining archive copies of formatted and laid out documents, provided that text, image and other content are also held separately.

22Comments

Add yours

1

Joss on April 22, 2020 at 7:00 am

From my own experience I can confirm all of that, especially MS Office documents (corruption due to save errors on an older Windows machine) and PDFs. I once downloaded a PDF, a large one (over 200 MB), and the original file had been corrupted at one point, because it was a torrent download, and some (very few) data chunks were missing right at the very end of the download. But the PDF was based on a book scan, with OCR and images in separate layers, and luckily it was mostly readable on the text side, while some images were missing, and two pages were also missing the visible text, but the OCR text layer was intact on these pages, so you could select & copy the invisible text layer. Pretty weird experience at the time. 😉

LikeLiked by 1 person
- 2
  
  hoakley on April 22, 2020 at 2:03 pm
  
  Thank you, Joss.
  Howard.
  
  LikeLike
3

Duncan on April 22, 2020 at 11:51 am

Howard, this is fascinating research (and hopefully useful information for archival practitioners). I have always used ‘lowest common denominator’ formats with my own files, including plain, unformatted text almost exclusively for anything written, but with a different motivation, namely forward compatibility. Having used Macs since 1985 I’ve seen a lot of proprietary formats come and go (I still have a bunch of work from the early years trapped in SuperPaint files, but I also still have an old Mac IIci in storage if I ever get motivated enough to attempt to read them). I also still have almost all of my email correspondence dating back to the late 80s (hundreds of thousands of messages, all searchable thanks to Spotlight) and to this day still use plain text for all my outgoing messages, with the exception of attached graphics.

So it’s encouraging to now read that those same formats are also somewhat resilient, which makes sense since they eschew a lot of debatable complexity which would only hinder recovery efforts if required.

As a different area of research, I’m curious if you intend to also test the effects of encryption on resiliency. At first glance it should seem that encrypted files would be all-or-nothing in terms of recoverability, but now with the separate layer afforded by the T2 chip I’m not quite sure how that would work.

LikeLiked by 1 person
- 4
  
  hoakley on April 22, 2020 at 2:02 pm
  
  Thank you. Yes, there’s more to come still.
  Howard.
  
  LikeLike
5

Curtis Wilcox on April 22, 2020 at 12:18 pm

> Even low levels of corruption can cause Microsoft’s apps (and Apple’s Pages and Numbers) to report unrecoverable errors

Since Microsoft’s “.*x” documents and Apple’s documents are all folders of content compressed into a single Zip file, I’m not surprised they’re not resilient against corruption. If they were decompressed for storage, they might fare better.

LikeLike
- 6
  
  hoakley on April 22, 2020 at 2:02 pm
  
  Thank you. Yes, that isn’t a design for resilience. However, compressing objects within a file, as in PDF, can prove more resilient.
  Or maybe the apps should offer an uncompressed option, perhaps.
  Howard.
  
  LikeLike
  - 7
    
    Duncan on April 22, 2020 at 4:11 pm
    
    I’m certainly glad we’re past the era of using StuffIt to manage once-precious disk space. (I went through that phase for a few years before capacity bloomed and prices dropped.) I probably have some old .sit bundles buried in my file archives that might be lost to time.
    
    LikeLiked by 1 person
    - 8
      
      hoakley on April 22, 2020 at 8:53 pm
      
      Thank you. More coming Friday morning, with some pleasant surprises.
      Howard.
      
      LikeLike
9

Raoul on April 22, 2020 at 11:37 pm

Apple almost came to the party with APFS in that the APFS filesystem creates checksums when data is written. Disappointingly, Apple create checksums for metadata only. 8((
Perhaps a future version of APFS when used with RAID setups will have an option to create checksums for data blocks as well.

Creating checksums for data blocks as it’s written to disk provides a means to test each file’s integrity on the fly when as the data is read back at a later date.
Should the original checksum of the data block not match the checksum calculated on the fly as the data is being read, enables many options to do something about it.

When I put myself in Apple’s shoes I can understand why they stopped at only checksumming metadata and not go all the way and checksum data blocks as well.

Given that most Apple users probably don’t know what redundancy is, plus the fact that more laptops are sold compared to desktops, it would be too costly to have redundancy “out of the box” baked into every Mac sold.

Operating systems are starting to move towards CoW filesystems that checksum everything now. For example, Ubuntu 19.10 offers it as an option when installing and I understand it will be the default in version 20.04 by the looks.

If you search, you can have your data with checksums on macOS today, but it’s not something that is advisable for your typical standard Mac user unfortunately.

LikeLiked by 1 person
- 10
  
  hoakley on April 23, 2020 at 5:54 am
  
  Thank you.
  I don’t see any documentation of the use of checksums when writing data. Do you have a reference, please?
  Yes, there is limited use of checksums in metadata, which is mentioned in the existing APFS docs. They’re of no help at all in checking file integrity, though, and weren’t intended to.
  Yes, macOS can use other file systems such as ZFS to provide ECC, although I can’t find any proper evaluation of its effectiveness. Do you know of any?
  Although digests/checksums can reveal whether data is intact or corrupt, you need ECC to do anything useful in terms of recovering corrupt files.
  You write “if you search, you can have your data with checksums on macOS today”. I’d be delighted to know how you can do that on a Catalina boot volume, please, which is what most people need.
  Howard.
  
  LikeLike
11

Raoul on April 23, 2020 at 9:20 am

Hi Howard,

I’m sorry, I didn’t want to hijack your post as I was referring to ZFS when I wrote “if you search…” so apologies for not being clearer in that APFS (at present) cannot help us.
You can however use ZFS to boot macOS (kernel on HFS/APFS though) and so you get all the checksum magic and another niceties that come with ZFS. Once again though, this is no-where-near something I’d recommend for anyone to try other than for pure hacker curiosity.

As for demonstrating ZFS’ effectiveness to correct data corruption, it’s baked into code such that if you have a pool with 2 mirrored disks, if the OS asks for some data (i.e. a word file), ZFS goes off and locates the file pointer, reads the blocks to calculate a checksum on the fly and compares that to the original checksum. If they don’t match, ZFS will ask the 2nd disk to repeat the process and if the checksums match, send that up the data path to the OS, but then go back and “resilver” the blocks on disk 1 and increment the checksum error counter for that disk. If ZFS cannot resilver the blocks in question, it increments the read counter and marks the pool as degraded. This is when emails/sms and all sorts of notifications can be sent to he data admins to warn them to replace the disk.

https://www.youtube.com/watch?v=CN6iDzesEs0 is a live stage presentation of the above.

I’ve been using ZFS since about 2005 and have 4 disks fail on me in that time, but not catastrophically and so I probably would’ve written more data to them and then really got in a pickle considering my backups were copying corrupted data as well! That is assuming I wasn’t running ZFS… ;))

LikeLiked by 1 person
- 12
  
  hoakley on April 23, 2020 at 11:04 am
  
  Thank you.
  You mean I could boot into Catalina on ZFS on the internal SSD in this iMac Pro? I don’t understand how that could be achieved without full APFS System and Data volumes, nor how you can have a ZFS volume on the boot disk.
  You describe what I suspected: ZFS doesn’t actually use ECC at all, merely automates what already exists for APFS (and what I’m already using for some files here). That’s helpful for servers and system run by sysadmins, but for the vast majority of users seems of no real benefit.
  Relying on RAID mirrors isn’t without its own problems, and what you describe is far less efficient than a proper ECC system, which might only require 1.3 times storage capacity to support full recovery in many instances of file corruption, rather than 2.0 times capacity, which is what a RAID mirror requires. Furthermore, as I’ve shown in another article, an ECC Parchive (for instance) is reasonably resilient itself to corruption. If your mirror copy has 1 bit corrupted, the system you’ve described fails, which is therefore significantly less resilient than proper ECC.
  Howard.
  
  LikeLike
  - 13
    
    Joss on April 23, 2020 at 1:40 pm
    
    It actually looks like ZFS works as the file system for a macOS boot volume, even on Catalina: https://openzfsonosx.org/wiki/ZFS_on_Boot
    
    As for ECC and par2: that’s why there are many vocal advocates of Unraid instead of standard RAID, because it has one (or two) discrete parity drives/volume(s). To me it seems more akin to par than to what standard RAID setups are doing. I haven’t really researched this variant, but rebuild times could in fact be faster, though write speeds on Unraid aren’t great in general (apparently). But it all depends on the kind of server and kind of storage. If you’re regularly backing up from one server to at least one other, JBOD would be A-OK. If there’s data failure, it’s always only on one disk, so if you have a backup, just replace the HDD & do a quick restore; and if you have no backup, living with the data loss need not be as catastrophic, depending on what you’re storing. (Obviously, you wouldn’t want the latter for private/important files.) Or you can do a mix: RAID for some files, and JBOD for others.
    
    LikeLiked by 1 person
    - 14
      
      hoakley on April 23, 2020 at 4:18 pm
      
      Thanks, Joss.
      I had looked at that wiki page at OpenZFS, and don’t believe that has been tried with any more recent version of macOS than Sierra. It still refers to creating an HFS+ boot volume! If anyone here does have OpenZFS for their Catalina startup volumes, I’d be most interested to hear how they did it. AFAIK, it’s impossible, because of how much of macOS requires the System and Data volumes and the firmlinks between them.
      As I drill down more into ECC, the more I realise that there’s a huge gap between research and what we see on our Macs (and elsewhere). For example, most ECC refers to using Reed-Solomon codes, which have been demonstrated to perform relatively poorly on binary data, and are now only recommended for characters/symbols. Whether those ECC products are really using something more appropriate, I don’t know. Or maybe they’re just old and under-researched. So performance testing is very important.
      Howard.
      
      LikeLike
15

Raoul on April 24, 2020 at 8:47 am

Have a chat with Lundman in the MacOS X ZFS forums and he’ll soon bring you up to speed about booting from ZFS.

Just so you know, ZFS by default writes a single copy of a file to a pool.
But you can tell ZFS to write multiple copies to a pool. This is very handy if you only have a a single disk/media such as a laptop. You can tell ZFS to write the same block 3 times for example… you can change you mind whenever you want.
Then, in the event that a checksum doesn’t match when the file is read, ZFS can source the correct blocks from the other two locations on the media on the fly, just like the case if multiple disks exist in the pool. Admittedly, if this did happen, I’d be replacing the drive immediately and not think that multiple copies will keep me safe… But just wanted to point out that ZFS is more than just a filesystem, it has a volume manager baked right in as well and is a pleasure to use.

If only Sun hadn’t got acquired by Oracle… we would’ve have ZFS on macOS a decade ago (Apple officially released a kext to read ZFS filesystems out of the box!) and APFS wouldn’t exist. Thankfully Sun open sourced the code which gave it a chance to survive in the wild, and even better now that the Linux community have embraced ZFS.
Who knows, Apple may even provide some basic support to read ZFS again in the future.

Grab two USB sticks Howard and have a play. You’d admire how elegant it is to express one’s intent. You can also just use a zvols which are files to represent physical disks, this is what I do to store data on cloud services ;))

Regards,

LikeLiked by 1 person
- 16
  
  Joss on April 24, 2020 at 8:54 am
  
  Wasn’t ZFS supposed to be the new filesystem of choice for macOS at some point, the successor to HFS+? If I recall correctly, before Apple decided to go with their own system (APFS), mainly to cover mobile devices too, there were quite a few supporters of ZFS at Apple. Apparently, ZFS almost happened.
  
  LikeLiked by 1 person
  - 17
    
    hoakley on April 24, 2020 at 8:58 am
    
    Yes. It’s still quite a bitter issue for some.
    In fairness, ZFS wasn’t and still isn’t ready for use by ordinary users, and that’s something that very few involved in the debate seem to recognise. For all its faults, APFS in the main does ‘just work’, whereas ZFS requires quite expert management. The thought of unleashing ZFS on a couple of billion iPhones, for example, still makes me giggle.
    Howard.
    
    LikeLike
- 18
  
  hoakley on April 24, 2020 at 8:54 am
  
  Thank you.
  I’m sorry, I’m not in the mood for having “a play” with a file system. File integrity is about production tools and techniques. I’ll be very interested to look at this when it’s ready for general use. In the meantime, I conclude that ZFS doesn’t use ECC as such, but file replicates for recovery in the event of corruption.
  Howard.
  
  LikeLike
19

Bob on May 21, 2020 at 12:56 am

For future reference, if you want to keep PDFs for the long haul, look at the PDF/A format. “A” is for archive. I think you can find details on the Wikipedia page for the PDF format. While this doesn’t address corruption, it does address future-proofing, a different kind of problem that can render a document unreadable. A PDF certified as PDF/A means that certain features are avoided, to ensure the document remains readable as standards change. There is even an app available to test a PDF and tell you if it’s equivalent to PDF/A. Sadly, MacOS doesn’t support that format. It mystifies me why a platform so entrenched in publishing doesn’t support more PDF formats. Luckily Libre Office does support the format. So if you need to store something on the digital equivalent of acid-free paper, check it out.

“Write programs to handle text streams, because that is a universal interface.”
Doug McIlroy, head of the Bell Labs Computing Sciences Research Center, and inventor of the Unix pipe.

LikeLiked by 1 person
- 20
  
  hoakley on May 21, 2020 at 5:52 am
  
  Thank you.
  I have a long series of articles looking at PDF in depth, listed here. Variants of PDF/A aren’t designed to be any more resilient to corruption, but are intended to be more self-contained and less reliant on external resources such as fonts.
  macOS does support the PDF/A variants: I’ve not come across any PDF/A which can’t be opened and read perfectly well on macOS, and it respects their read-only status too. However, the Quartz2D PDF rendering engine in macOS doesn’t specifically support the generation of PDF/A. In this case (unlike with other PDF variant standards) this makes sense, as PDF/A essentially consists of restrictions which are placed on what is written to the file. Those are in general the responsibility not of the rendering engine, but the app which is taking PDF objects and writing them to the file.
  As you point out, some macOS apps do support the generation of compliant PDF/A documents. And if you happen to be a Martian with a reliable free income, you can always lease Adobe Acrobat ‘Pro’ DC, as I do (although I deny being a Martian, at least).
  It’s also worth pointing out that PDF/A does make many documents larger, depending on their reliance on external resources. In the worst case, with lots of normally external fonts, they can be a lot bigger, and contain a great deal of data which is common to other PDFs, so being largely redundant.
  One change to the basic PDF standard which could readily make PDFs more resilient would be to require its objects to be stored uncompressed. That isn’t, as far as I know, available in any PDF variant, though.
  Howard.
  
  LikeLike
21

Bob on May 21, 2020 at 12:24 pm

My apologies, I should have been more specific. I meant that Mac does not intrinsically support generation of PDF/A. Back in the day, it used to have an option to create PDF/X but they did away with it, I think after Lion or Mountain Lion. And thank you for the list of articles. I shall peruse them. I only recently discovered your site, and very much enjoy the combination of art and technology. A favored quote attributed to Steven Roberts: “Art without without engineering is dreaming. Engineering without art is calculating.”

As for reliability of PDFs and compression, one could argue that since the “P” in PDF is for “Portable” and size is a factor in portability, minimal size may have been a design goal. I haven’t read “founding documents” on the matter so am not authoritative on the topic. Yes, compression makes data corruption much worse by having a multiplier effect upon decompression.

Measuring resilience to corruption of a PDF is, by my way of thinking, unnecessary because it boils down to a specific measure of probability. If we define a PDF as “broken” when it cannot render at all, then we can look at the component parts of a PDF and divide them into two classes: those that can be corrupted without breaking the PDF, and those that cannot be corrupted without breaking the PDF. I’m not an expert on PDF, but I think the first class would contain things like embedded images and fonts. Corruption in these components would, I think, be limited to their portion of the document. The second class would be the code portion; it would be unlikely that corrupted executable code could be properly parsed for rendering. Of course, what percentage of a PDF consists of code would be highly dependent on the PDF, but let’s assume a balance of 90% images and fonts, and 10% executable code. If we accept that corruption is a random independent event, then there’s a 10% chance that this PDF will be broken. And I suspect that the dreaded Corrupt Document message appears in these cases; one simply cannot get past a broken syntactic and semantic analysis.

LikeLiked by 1 person
- 22
  
  hoakley on May 21, 2020 at 9:09 pm
  
  Thank you.
  The PDF format is potentially very robust, far more so than XML, for instance, in that it uses simple objects. With rare exceptions, any object can become broken, and all that happens is that part of the page can’t be displayed. Its syntax is very simple indeed.
  However, from decades of experience with damaged PDFs, even Adobe’s Acrobat ‘Pro’ is very poor at trying to recover from modest damage. More often than not, it simply abandons all efforts, and doesn’t even include repair tools. There’s a healthy PDF recovery and repair industry, as you might expect.
  Any file format which is intended to be used for archival purposes needs to have good resilience to corruption. If the format becomes unrecoverable when a few bytes get changed, then it’s quite unsuitable for files which might need to be opened in several decades time. We can read cuneiform today simply because the tablets on which it was written survived, and can be deciphered. There’s no point in using PDF/A if damaged files can’t be recovered in the future – we may as well stick with ink on paper, which with a little care will last more than a millennium.
  Howard.
  
  LikeLike