In my quest for methods to preserve the integrity of files, I’ve so far considered almost exclusively those of modest size, such as still images and regular office-style documents. Today I turn to look at the problems posed by much larger files, with sizes greater than 1 GB, which for many will include video and movies, virtual machines, and perhaps databases.
Keeping multiple copies of very large files is expensive, and consumes storage at an alarming rate. They’re also difficult and expensive to commit to removable archive media such as optical disks: conventional Blu-ray disks offer 25 or 50 GB, with 100 GB the current practical maximum. Magnetic media such as tape and hard disk can readily extend to TB of storage, but at significant risk of data corruption over time.
If there’s one area for which error-correcting codes (ECC) would appear to be in greatest demand, it should be the storage of large files. Unfortunately, their promise has yet to be realised, at least on macOS.
The warning sign came in one of the graphs in my previous article which looked at ECC on smaller files, in the relationship between original file size and maximum correctable damage.
To look at larger files, I produced another new version of my corruption app Vandal which can now produce low levels of damage in files of more than 1 GB, and used that to damage an 18.2 GB disk image test file. Once again, I used the Par2 standard in Parchives produced by MacPAR deLuxe, which remains the only ECC utility I have been able to locate for current macOS.
The first problem to become apparent is the time and CPU effort required to generate Parchives and repair damaged files: coding to Parchive format proceeds at a little less than 50 MB/sec, so my test file typically took more than 6 minutes to turn into an ECC set. That’s using the 3.2 GHz 8-Core Xeon W in my iMac Pro, with its fast internal SSD. Decoding and file recovery took slightly longer, over 7 minutes, at a rate of just under 40 MB/sec. A lot of that time was spent with Activity Monitor’s % CPU indicating more than 1000%, the highest sustained workload I’ve seen since I bought this Mac.
The Parchive format is very compact, though. My 18.2 GB test file required only 20.0 GB as a protected Parchive, that’s 110%, at the standard 10% level of redundancy. Compared with 200% for a simple RAID level 1 mirror, that’s a major advantage.
Creating randomised evenly-spread single-byte corruption within the test file posed the ECC problems with recovery. Whereas 4 MB test files used previously could still be recovered successfully after about 14 bytes of corruption per MB, in the large test file this fell to about 0.01 bytes per MB. Once corruption rose above 1 byte per 100 MB, or 200 bytes in the whole 18.2 GB test file, recovery proved consistently unsuccessful.
This doesn’t of course mean that any more than 200 bytes total corruption in an 18.2 GB file can’t be recovered. Longer runs of corrupted bytes at fewer separate locations through the file are more likely to be recoverable, but this still isn’t impressively resilient. It’s the result of the design of the Parchive and Par2 standards, which were intended to repair much smaller files using a minimum of parity data.
For an ECC method to be suitable for use with files larger than tens of megabytes, and ideal with those larger than a gigabyte, a different approach is required:
- The ECC used needs to be optimised for speed, using a modern technique such as Turbo, LDPC or Polar code, which is used in 5G, for example.
- Both encoding and decoding need to be accelerated, by making better use of multiple cores and the GPU.
- Recovery slices need to be adjusted to adapt better to very large files.
Until these have been accomplished, the use of Par2 Parchive ECC with files larger than about 100 MB in size is unlikely to prove worth the effort. Despite the promise of ECC, the only effective methods of preserving file integrity for such large files remains duplication, either in mirrors or simple copies.