There are two effective strategies for preserving the integrity of important files: make multiple copies, and store them with error-correcting code (ECC) which will enable their recovery if and when they become damaged or corrupted. Last week I discovered that there is ECC available for all Macs in the Par2 or Parchive format, provided in Gerard Putter’s free utility MacPAR deLuxe. This article evaluates how effective that ECC is in practice.
To run these tests, I used a selection of regular documents on my iMac Pro running Catalina 10.15.4. Files were corrupted using my utility Vandal, which overwrites individual bytes at randomly-selected locations within the file with random bytes, to achieve an even corruption rate which can be set from 1 B/MB upwards to over 500,000 B/MB. The most frequently used test file was the SpringerOpen title Error-Correction Coding and Decoding, which I though might be appropriate, and is 10.9 MB in size. Other test files included PDFs and one HEIC generated by the camera in an iPhone. As the file corruption technique, ECC and assessment are independent of file type, the conclusions should apply to any file, although some corrupted files are easier to use or repair when they’re still corrupted.
MacPAR deLuxe was used to produce a Parchive in Par2 format, a folder of .par2 files which are used together with the original file to try to reconstruct the original. When the original is undamaged, the .par2 files aren’t required to do this, merely to confirm that the file’s checksums are correct. When the original file has been corrupted by Vandal, it fails some of the checksum tests and MacPAR deLuxe attempts to repair it using ‘parity’ data stored in the .par2 files. When repair can be completed to restore the expected checksums, the app writes the original uncorrupted document; when attempts to repair are unsuccessful, the app reports that and all files are left unaltered.
There is one relevant control in the app: the “level of redundancy” used when creating .par2 files, by default set at 10%. That doesn’t indicate the percentage of corruption which can be corrected, but the amount of redundant repair information stored in the .par2 files, hence their total size.
Vandal’s corruption
To illustrate the effect of corruption created by Vandal, here are three copies of the same HEIC image at different levels of corruption.
This is the original, uncorrupted.
This has been corrupted using Vandal to a level of 45 B/MB, from which MacPAR deLuxe was able to completely recover the original image.
Corrupted at 46 B/MB, MacPAR deLuxe was unable to recover this original image, though.
Total file size and recoverability
Increasing levels of redundancy result in greater total file size, with more ‘parity’ data being stored in more .par2 files. There’s an initial overhead of about 10% for using any Parchive Par2 ECC, on top of which total file size rises linearly. For the test 10.9 MB file, standard 10% redundancy resulted in a total file size of 13.5 MB, rising to 16.4 MB for 30% redundancy.
Because Vandal inflicts random corruption throughout a file, recoverability using ECC varies slightly at any given level of corruption. At least three different tests were performed to estimate the maximum correctable damage for each level of redundancy. With no ECC (0%), any corruption makes the file unrecoverable to the original. At 10%, ECC was able to recover files with less than 15-18 B/MB corruption, rising to 60-65 B/MB at 30% ECC redundancy. Because of the complexity of the ECC method used in Par2, this isn’t quite a linear relationship.
These results can be brought together to show the relationship between the total size of files (as determined by the percentage ECC redundancy set in the app) and their maximum correctable damage (as B/MB). This confirms that using additional storage space to store more .par2 files results in significant additional resilience to file corruption – exactly as you would hope for. The ECC is doing its job effectively.
Original file size and recoverability
The results above all refer to one file, my 10.9 MB PDF. Using a different file suggested that there is a relationship between the size of the original file – that is, the file to be protected by ECC – and the ability of Par2 to recover it following corruption. I therefore estimated maximum correctable damage on four different files, three PDF and one HEIC, with sizes ranging from 3.2 to 15.4 MB, all at the same 10% ECC redundancy. To my surprise, smaller files proved much more effectively protected by Par2, and larger files much less so. A 3.2 MB original could be corrected successfully with corruption of 50-60 B/MB, whereas a 15.4 MB file could only be recovered with 6 B/MB or less corruption.
The relationship between original file size and maximum correctable damage is markedly non-linear too, as shown above. If this can be extrapolated, the maximum correctability for a very small file is just under 100 B/MB, whilst original files larger than about 20 MB can only be successfully recovered from corruption of around 1-2 B/MB.
This has serious implications for the use of Par2 with files much larger than 20 MB, and probably rules it out as a method of ECC for those larger than 1 GB. This may be the result of optimisation in the method to provide protection for files likely to be disseminated via Usenet Newsgroups. However, it emphasises the importance of using a range of test file sizes when assessing any form of ECC. Very large files are those most vulnerable to some forms of corruption.
Corruption of all files
The results above were obtained when only the original file was corrupted, leaving all the .par2 ‘parity’ data intact. In many cases of file corruption, that isn’t a realistic scenario: whatever corrupts the original file in a Parchive is likely to have corrupted several or all the files in the set.
I looked at this in two examples, my 10.9 MB PDF and 4.1 MB HEIC image with 10% ECC redundancy. With only the original file corrupted, and the .par2 files left intact, the maximum correctable damage was 15-18 B/MB for the PDF, and 45-46 B/MB for the smaller HEIC. When all the files were corrupted, those fell to 15 B/MB and 25-30 B/MB respectively. This again appears to be a highly non-linear effect, but demonstrates that damage to the .par2 files isn’t catastrophic in its effect on recoverability.
How much is that damage?
Expressing levels of file corruption in B/MB can be deceptive, making them look impressively large. Expressed in terms of percentage, they appear much smaller: 10 B/MB is 0.001%, or 100 bytes in a 10 MB file. Although Vandal does spread those damaged bytes across the whole file, that’s not really a great deal of corruption.
Conclusions
The Parchive Par2 ECC method, as implemented in MacPAR deLuxe is an effective way of guarding against file corruption. But:
- it requires significant additional storage space for its ‘parity’ files, which increases with higher ECC redundancy levels, although those also increase maximum levels of correctable damage;
- it provides greatest protection to smaller files, and doesn’t appear to provide much protection for files larger than about 20 MB;
- it is unsuitable for protecting files larger than 1 GB;
- its protection is significantly reduced when its ‘parity’ files are also corrupted;
- it cannot provide any protection against high levels of corruption, above 100 B/MB or 0.01% of data, even for very small files.
ECC appears promising, but needs careful evaluation in real-world applications. Simply having ECC enabled doesn’t mean that damaged files can be recovered. There are also many different methods available for ECC, and for each different implementations, all of which are deeply technical. There are likely to be marked contrasts in their effectiveness and efficiency which may be unexplored.