Of all the tools I have around the house, by far the most used is the tape measure. For without knowing how big that hole is, or what size of wood you need, or how big the space for the fridge is, no job can get started. Measurement is the key to so much.
As we’re all discovering about Covid-19, the most important countermeasures depend on basic tools. If you can’t diagnose infection accurately, then it’s almost impossible to investigate or do anything about it. Knowing whether someone has had a previous infection is another vital step before you can get an accurate idea of its prevalence in the population, and whether any immunisation might work.
So too with file integrity. Here two of the fundamental tools required are some means of assessing whether data in files has been changed or corrupted, and a way of deliberately and controllably corrupting them. Over the last few weeks I have provided Dintch and most recently its sibling Fintch as accessible apps for checking file integrity. Last week I put together another app, which can corrupt files controllably. For obvious reasons, I’m not adding this to the list of downloads here, although if you want to use a copy for research, I’ll he happy to provide it.
Tomorrow I’ll be publishing here my first results from using Vandal, as this app is appropriately named, to determine how effective error-correcting codes (ECC) are. Although I have seen extensive predictions based on mathematical calculations, I have been attempting to discover how well ECC works in reality by corrupting a series of test documents and observing how reliably they can be recovered. Without that knowledge, you can’t tell whether ECC is worth the effort.
Following that, I’m using Vandal to add low levels of corruption to some standard document formats to assess how robust they are against such damage. I’ve been surprised that this crucial information doesn’t appear readily accessible, and isn’t included in otherwise detailed accounts of modern image and other file formats. Instead, we seem more concerned about qualities such as compression and its effects on image quality.
File corruption or damage remains one of the commonest problems users encounter, and is far more common than, for example, problems resulting from malware. Among the many questions people ask of me, it’s the one that just keeps coming. For those whose instant solution is backing up, although that definitely makes a big difference, it often isn’t a helpful answer. Sadly, many users do go to their backups, only to find that all copies there are also broken, which isn’t surprising as backup software can’t verify that it isn’t copying what’s already corrupt.
We also tend to store backups on the least reliable media, in particular traditional hard disks, with their long record of errors and failures. A typical basic Mac system now consists of a computer with its internal SSD and an external hard disk for Time Machine backups. Most RAID and NAS systems used by Macs still consist of hard disks rather than SSDs. Unless and until the price of SSDs falls considerably, that’s likely to remain the case for many years to come.
At present, Vandal doesn’t try to mimic any well-described patterns of file corruption, and with so many to choose from, I thought it best to ensure that the damage which it creates is consistent and controllable. It therefore replaces a random selection of bytes in a file with random bytes. These are spread evenly across the entire file, rather than concentrating them in any small area. You choose the rate of corruption, from a single byte in each MB of data, up to a maximum of more than half the total number of bytes in that file. These are spread evenly across the whole length of the file, and the app’s code prevents two random bytes being written to the same locaton. When you set the rate at 10 B/MB, for instance, that guarantees that a 10 MB file will have exactly 100 ‘corrupt’ bytes written to it.
I won’t anticipate tomorrow’s article about the ability of ECC to recover deliberately corrupted files, other than to say that it can, and what’s more it can do so even when its own additional ‘parity’ data are corrupted too. I don’t know how different methods of ECC compare, though. For example, one of the selling points of the ZFS file system as against APFS is that it incorporates ECC to repair damaged files, but so far I’ve been unable to discover any careful measurements of its efficacy compared with long-proven Parchives, the subject of tomorrow’s article. If you run ZFS on your Mac, I’d be particularly interested to hear from you as I’m keen to get some measurements from that for comparison, as I am of any other form of ECC, please.
Some of the most-used file formats for documents which we want to keep for the future are image formats, such as JPEG and now HEIC, and PDF. I’m not aware that any are intended to be specially robust in the face of damage, but was pleasantly surprised to see that even quite heavily corrupted images could still be opened, and some of their contents remained usable. But effects on different formats are contrasting, and some are more vulnerable to diffuse changes which make the whole image unusable. That is also true of PDF, but not the case with some other text formats.
For too long we have assumed what we already know to be impossible, and pretended that just making ‘safe’ copies of important documents will preserve them for the future. It’s time for a bit of Vandalism.