Although I never intended my tests of the effects of file corruption to be models of any particular type or mechanism, many of you have wondered what differences there are between its single-byte damage and, say, a block of 512 bytes being affected by a disk problem. This article addresses that question in two key areas: image files, in which effects are easiest to see, and repairability using error-correction code (ECC).
To run these tests, I have updated my utility Vandal so that it works with 10 MB blocks of data, and can therefore control levels of damage down to 1 byte per 10 MB, and to corrupt blocks of bytes in a range from 1 to 1,000 (1 KB). For these tests, all files used are less than 10 MB in size, so only had one area of corruption in the whole file. This contrasts with my previous tests, in which single bytes were corrupted at multiple random locations evenly throughout the whole file.
I have so far looked at the following examples:
- an uncompressed TIFF containing a Fujifilm test image of 9.5 MB;
- a 16 Mpixel JPEG compressed at 1:6 in a file of 9.98 MB;
- a 12.2 Mpixel HEIC compressed at 1:10 in a file of 4.8 MB.
In each example, I have produced random corruption using Vandal of 1 byte, and a single block of 10, 100, and 512 bytes length.
As expected, and suggested from my previous tests, visible damage to the TIFF was negligible until the size of the damaged block reached 100 bytes. It then presented as a single short line of damaged pixels which would be quite easy to correct.
The JPEG proved less resilient, and even with a single byte of corruption large areas of the lower part of the image were lost. However, this wasn’t consistent across different block sizes, and the 10 byte test had little effect on the image.
Effects on the HEIC image differed little according to the size of the corruption: I show below the results for 1, 100 and 512 byte blocks, with the damaged rectangles in the image outlined in red for ease of identification. Because of the compression method used in HEIC files, individual corrupt bytes each cause corruption in a rectangular section of the image; increasing the number of adjacent bytes which are corrupt has little effect on that damaged area, but more separate points of corruption result in more visibly damaged rectangles in the image. In that sense, HEIC is more resilient to larger amounts of corruption, so long as they are confined to one section of the file.
HEIC test file with one single corrupted byte in 4.8 MB of data. Damaged area outlined in red.
HEIC test file with 100 consecutive corrupted bytes in 4.8 MB of data. Damaged area outlined in red.
HEIC test file with 512 consecutive corrupted bytes in 4.8 MB of data. Damaged area outlined in red.
Recovery using ECC
Error-correcting code is complex. Although I would expect it to cope as well with large blocks of corrupt data as it did with multiple single bytes, I suspect that some of you may be disbelieving. To assess how well Parchive (Par2) ECC can recover files with larger blocks of corruption, I located a 9.99 MB PDF file to use for testing. I then corrupted it using Vandal to create 1,000 byte (1 KB) blocks of damage, and determined which damaged copies could be recovered completely using Par2 ECC at the standard 10% parity level. The total size of the original file and its .par2 parity data is 12.46 MB, i.e. 125% of the unprotected file.
Corrupting only the original PDF file, and not the Par2 parity data, the original file could be recovered fairly consistently when up to 95 1 KB blocks had been corrupted, which is a total of 95 KB out of the 9.99 MB, or 1% of the whole file. With that level of damage, the PDF is essentially unusable before ECC is performed.
I repeated that test using the same PDF, but this time applying the same level of corruption to both the PDF and its parity files. Recovery was then very probable up to a level of corruption of 35 blocks of 1 KB. As all the parity files are smaller than the original PDF, the amount of damage which each sustained was higher than the 35 K in 9.99 MB or 0.4% in the PDF, for example being as high as 35 KB in 652 KB, or 5%.
These tests confirm that even when file corruption occurs in large blocks of up to 1 KB, recovery using only modest quantities of parity data and the Par2 method is remarkably effective.
Patterns of file corruption and cause
Much as we’d all like to have detailed information about the real-world risks of different causes of file corruption, those data aren’t to be won easily. Studies of ‘bit rot’ require long-term observation of different types of storage, and careful assessent of any detected file corruption over several years at the very least.
Using very different patterns of file corruption, resilience of different file formats and the recovery abilities of ECC techniques don’t appear to change dramatically. Although it’s valuable to be able to vary the size of blocks of corruption, it doesn’t look likely to change many conclusions drawn from testing. Developing strategies to mitigate file corruption doesn’t appear to be heavily dependent on better understanding of cause or risk.