File Integrity 5 : How well does error-correcting code work?

There are two effective strategies for preserving the integrity of important files: make multiple copies, and store them with error-correcting code (ECC) which will enable their recovery if and when they become damaged or corrupted. Last week I discovered that there is ECC available for all Macs in the Par2 or Parchive format, provided in Gerard Putter’s free utility MacPAR deLuxe. This article evaluates how effective that ECC is in practice.

To run these tests, I used a selection of regular documents on my iMac Pro running Catalina 10.15.4. Files were corrupted using my utility Vandal, which overwrites individual bytes at randomly-selected locations within the file with random bytes, to achieve an even corruption rate which can be set from 1 B/MB upwards to over 500,000 B/MB. The most frequently used test file was the SpringerOpen title Error-Correction Coding and Decoding, which I though might be appropriate, and is 10.9 MB in size. Other test files included PDFs and one HEIC generated by the camera in an iPhone. As the file corruption technique, ECC and assessment are independent of file type, the conclusions should apply to any file, although some corrupted files are easier to use or repair when they’re still corrupted.

eccrecovery1

MacPAR deLuxe was used to produce a Parchive in Par2 format, a folder of .par2 files which are used together with the original file to try to reconstruct the original. When the original is undamaged, the .par2 files aren’t required to do this, merely to confirm that the file’s checksums are correct. When the original file has been corrupted by Vandal, it fails some of the checksum tests and MacPAR deLuxe attempts to repair it using ‘parity’ data stored in the .par2 files. When repair can be completed to restore the expected checksums, the app writes the original uncorrupted document; when attempts to repair are unsuccessful, the app reports that and all files are left unaltered.

eccrecovery2

There is one relevant control in the app: the “level of redundancy” used when creating .par2 files, by default set at 10%. That doesn’t indicate the percentage of corruption which can be corrected, but the amount of redundant repair information stored in the .par2 files, hence their total size.

eccrecovery2a

Vandal’s corruption

To illustrate the effect of corruption created by Vandal, here are three copies of the same HEIC image at different levels of corruption.

eccrecovery3

This is the original, uncorrupted.

eccrecovery4can45

This has been corrupted using Vandal to a level of 45 B/MB, from which MacPAR deLuxe was able to completely recover the original image.

eccrecovery5cant46

Corrupted at 46 B/MB, MacPAR deLuxe was unable to recover this original image, though.

Total file size and recoverability

Increasing levels of redundancy result in greater total file size, with more ‘parity’ data being stored in more .par2 files. There’s an initial overhead of about 10% for using any Parchive Par2 ECC, on top of which total file size rises linearly. For the test 10.9 MB file, standard 10% redundancy resulted in a total file size of 13.5 MB, rising to 16.4 MB for 30% redundancy.

eccrecovery7

Because Vandal inflicts random corruption throughout a file, recoverability using ECC varies slightly at any given level of corruption. At least three different tests were performed to estimate the maximum correctable damage for each level of redundancy. With no ECC (0%), any corruption makes the file unrecoverable to the original. At 10%, ECC was able to recover files with less than 15-18 B/MB corruption, rising to 60-65 B/MB at 30% ECC redundancy. Because of the complexity of the ECC method used in Par2, this isn’t quite a linear relationship.

eccrecovery8

These results can be brought together to show the relationship between the total size of files (as determined by the percentage ECC redundancy set in the app) and their maximum correctable damage (as B/MB). This confirms that using additional storage space to store more .par2 files results in significant additional resilience to file corruption – exactly as you would hope for. The ECC is doing its job effectively.

eccrecovery9

Original file size and recoverability

The results above all refer to one file, my 10.9 MB PDF. Using a different file suggested that there is a relationship between the size of the original file – that is, the file to be protected by ECC – and the ability of Par2 to recover it following corruption. I therefore estimated maximum correctable damage on four different files, three PDF and one HEIC, with sizes ranging from 3.2 to 15.4 MB, all at the same 10% ECC redundancy. To my surprise, smaller files proved much more effectively protected by Par2, and larger files much less so. A 3.2 MB original could be corrected successfully with corruption of 50-60 B/MB, whereas a 15.4 MB file could only be recovered with 6 B/MB or less corruption.

eccrecoveryx

The relationship between original file size and maximum correctable damage is markedly non-linear too, as shown above. If this can be extrapolated, the maximum correctability for a very small file is just under 100 B/MB, whilst original files larger than about 20 MB can only be successfully recovered from corruption of around 1-2 B/MB.

This has serious implications for the use of Par2 with files much larger than 20 MB, and probably rules it out as a method of ECC for those larger than 1 GB. This may be the result of optimisation in the method to provide protection for files likely to be disseminated via Usenet Newsgroups. However, it emphasises the importance of using a range of test file sizes when assessing any form of ECC. Very large files are those most vulnerable to some forms of corruption.

Corruption of all files

The results above were obtained when only the original file was corrupted, leaving all the .par2 ‘parity’ data intact. In many cases of file corruption, that isn’t a realistic scenario: whatever corrupts the original file in a Parchive is likely to have corrupted several or all the files in the set.

I looked at this in two examples, my 10.9 MB PDF and 4.1 MB HEIC image with 10% ECC redundancy. With only the original file corrupted, and the .par2 files left intact, the maximum correctable damage was 15-18 B/MB for the PDF, and 45-46 B/MB for the smaller HEIC. When all the files were corrupted, those fell to 15 B/MB and 25-30 B/MB respectively. This again appears to be a highly non-linear effect, but demonstrates that damage to the .par2 files isn’t catastrophic in its effect on recoverability.

How much is that damage?

Expressing levels of file corruption in B/MB can be deceptive, making them look impressively large. Expressed in terms of percentage, they appear much smaller: 10 B/MB is 0.001%, or 100 bytes in a 10 MB file. Although Vandal does spread those damaged bytes across the whole file, that’s not really a great deal of corruption.

Conclusions

The Parchive Par2 ECC method, as implemented in MacPAR deLuxe is an effective way of guarding against file corruption. But:

it requires significant additional storage space for its ‘parity’ files, which increases with higher ECC redundancy levels, although those also increase maximum levels of correctable damage;
it provides greatest protection to smaller files, and doesn’t appear to provide much protection for files larger than about 20 MB;
it is unsuitable for protecting files larger than 1 GB;
its protection is significantly reduced when its ‘parity’ files are also corrupted;
it cannot provide any protection against high levels of corruption, above 100 B/MB or 0.01% of data, even for very small files.

ECC appears promising, but needs careful evaluation in real-world applications. Simply having ECC enabled doesn’t mean that damaged files can be recovered. There are also many different methods available for ECC, and for each different implementations, all of which are deeply technical. There are likely to be marked contrasts in their effectiveness and efficiency which may be unexplored.

5Comments

Add yours

1

Duncan on April 20, 2020 at 12:39 pm

I’m grateful that you’re pursuing this in an objective manner. As you said when first reviewing the pervasiveness of storage media errors, there just doesn’t seem to be enough data available to make a proper assessment. With this project, you’re now contributing that data, at least outside the circles of computer-science and information theory academia.

Let’s hope this subject catches more attention – it’s past time to put computers to work at guarding our data. This should be table stakes for any ongoing OS/file-system development.

LikeLiked by 1 person
- 2
  
  hoakley on April 20, 2020 at 4:31 pm
  
  Thank you.
  It worries me that so many are either saying that errors never occur (which is plain wrong, they may be less common, but still do occur), or that the solution is making lots of backups. Most of my more important documents here are backed up on multiple independent storage. What a waste of space, and of course some of those backups may just be copies of damaged files anyway.
  Howard.
  
  LikeLike
  - 3
    
    Duncan on April 20, 2020 at 5:02 pm
    
    “Most of my more important documents here are backed up on multiple independent storage. What a waste of space, and of course some of those backups may just be copies of damaged files anyway.”
    
    That’s my main complaint with having to manage all this manually. To achieve the best likelihood of not encountering data corruption one has to be both very paranoid and unrealistically fastidious in checking and re-checking everything, at every level. Who has time for that?
    
    Again, computers can easily accomplish all sorts of background/automated tasks, from encrypting your data (T2 chip) to indexing all your files (Spotlight) to backing up your storage (Time Machine). Maintaining file integrity should also be included in those built-in capabilities.
    
    LikeLiked by 1 person
4

name99 on May 6, 2020 at 5:42 am

There are essentially two main FEC mechanisms, stream based and block based. Essentially stream based correct errors that are bit-wise random, ie every bit has an equal probability of being flipped (think cell phone stream) and block based correct errors that occur as bursts (think bad sector on a CD/hard drive/SSD). Most sophisticated systems use both, one layered on the other; the block scheme is usually called Reed Solomon, the stream based is usually called a convolution code (with ever higher levels of sophistication these days).

An app written by someone not a serious expert is probably block based, and not optimized for the sort of (stream-like, random bit) damage you are creating. Even so, it should be more robust than what you are seeing, closer to linear behavior. I’m guessing that certain critical data structures, like block headers, are not as robust as they should be, eg replicated across the file.

If you want to do better, first question is what’s the corruption model you are protecting against? Bad blocks, or bad bit flips? If you want to protect well against both, look at the layered schemes used by something like LTE or WiFi. then remember that those schemes primarily DETECT so that you can retransmit. If you want more robustness, you need more redundancy — look at the the levels of RS used for something like the DVD spec.

LikeLiked by 1 person
5

GAI on July 23, 2020 at 4:58 am

Full marks for considering the dara integrity problem, one that many people rationalize away or disregard entirely.

Bonus points for doing empirical testing!

However, the author needs to keep in mind his tools and methodology when drawing conclusions from this testing. When you specify a rate of data corruption, you will of necessity have more difficulty recovering large files. Corrupting 1gb at a rate of 50 B/MB creates ~51 kb of corruption whereas doing the same to a 100kb file creates perhaps 5 bytes of corruption. Error correction code’s (“ECC”) ability to cope scales on a ratio; “amount of redundancy”:”amount of corruption”. Your tool creates corruption in direct proportion to filesize, generating this observational artifact.

I suggest that if you create a uniform amount of corruption in your testing, such as 100 bytes in each file, you likely will find no difference in the ECC’s ability to cope between the large file and the small one.

That said, as I see it what is demanded is a two factor approach; a low level of redundancy ECC to solve small corruption issues, such as ‘silent’ bit rot [and more importantly, to automate the DETECTION of bit rot events], and full backups [ideally 3:2:1 backups that also contain the ECC information files] to provide a recovery option for grand failures, mass corruption, and total loss events.

No amount of ECC will fix a hard drive that burns in a structure fire, and no amount of simple backups will help you know if a bit has rotted in your archive. Hybridization is the key to assurance here.

LikeLiked by 1 person

Share this:

Related