Checking file integrity isn’t a full solution to preserving important documents, as it can’t by itself repair those which become damaged. Dintch can tell you whether your files still match the digest calculated when you tagged them, which gives you assurance that those whose digests haven’t changed remain intact. But what do you do when a digest doesn’t match? The naive answer is to look for a copy which does still match its digest. If you can’t find one, that file is irretrievably damaged.
Error-correcting codes (or error-correction codes, ECC) are the complete solution, and are already used in RAID level 6 implementations and elsewhere to ensure that even quite badly damaged files can be recovered. The maths behind them is amazingly complicated, involving things like Galois groups, and has a vocabulary of its own. At its heart, though, is the idea that you can supplement your important document with additional data which can be applied mathematically to repair missing or corrupt data in the file – recovery blocks.
Storing two copies of important documents adopts a similar if highly inefficient approach. If there’s corruption in one copy, assuming the second is still intact, you simply replace the whole file with the intact version. This can cope with 100% corruption of the file, but requires 100% additional storage capacity. It also depends on the second copy of the file remaining intact.
One of the most widely used ECC is based on what’s known as Reed-Solomon codes, which were introduced in 1960. These were originally aimed at correcting errors in serially-transmitted data, over radio signals and later in networking and similar communications. In the early days of the Internet, when Usenet newsgroups were a popular means of distributing files, Reed-Solomon codes were used for ECC of those files in a system known as Parchive. In 2002, it was proposed that a second version, Par2, was developed, and this became widely adopted. As users moved away from newsgroups and Internet connections became far more reliable, Par2 has largely been forgotten. However, it’s still available and well-proven ECC which today can be used to protect important documents.
For the last 18 years or so, Gerard Putter has offered MacPAR deLuxe for macOS. The current version runs a treat on Catalina, is notarized and remains completely free.
Par2 was designed to supplement the original file with a series of .par2 files containing checksums for the file and all the ECC data needed to reconstruct the original from a damaged copy. That original is divided up into segments, and different .par2 files contain different combinations of recovery blocks. This provides redundancy across the complete set, so that (unlike in the duplicate file example) the .par2 files are resistant to damage or corruption themselves.
MacPAR deLuxe takes one or more files and generates a set of .par2 files to accompany them for ECC. When you wish to open the original documents, the first step is to reconstitute them, fixing any errors using those .par2 files. The app does this automatically.
In this case, all the .par2 files were intact, and recovery of the original 10.9 MB PDF was instantly successful.
When I deliberately corrupted both the original PDF and each of the .par2 files, the app had a harder job, as it couldn’t recover all the recovery blocks. Because of its built-in redundancy, though, it was still able to recover the original document from the recovery blocks which remained.
To achieve this level of ECC, which is supposed to be able to recover documents whose files have suffered 10% damage, requires 24% additional storage space, which seems remarkably efficient.
From this brief testing, I think that Par2 has a lot of potential as a means of protecting important files using ECC, and it’s available now, at no cost. You don’t need a different file system such as ZFS, nor a RAID array for level 6. What I haven’t yet assessed is just how resilient it can be to corruption, and what the trade-off is with storage space.
To look at those, I’m writing an app which will deliberately inflict randomised corruption on files. I’ll be reporting back when I have further detailed information about the performance of Par2 ECC, as implemented in Gerard Putter’s impressive MacPAR deLuxe.