Efficient resilient storage

Most of what we do in computing reduces to processing, storage and output. Storage is central to everything we do, and its goals are to be reliable, efficient and resilient. Unfortunately, as this article shows, those three are in conflict; in particular, efficiency opposes reliability and resilience.

Theory

Raw file data aren’t, in general, efficient in terms of the space they occupy. Good examples of this are modern formats based on XML, which can readily contain gigabytes of text. That’s easily compressed using non-lossy algorithms, and even rapid methods can reduce the space required considerably. Among the most efficient practical methods of compressing long text files are those based on Lempel-Ziv codes, as they can code large numbers of symbols in the source into smaller bitcounts in the compressed version. However, they seldom approach the practical limit determined by the Shannon information content of the source.

Raw file data are usually more amenable to correction and repair in the event of damage or corruption, though. Because their information content is less dense, it can more readily be parsed by humans or machines, supporting correction and recovery.

As data is compressed towards the practical limit of its Shannon information content, it also becomes more susceptible to the effects of damage or corruption. Efficient stream codes such as Lempel-Ziv normally fail to decode correctly if any of the bits of the compressed file are altered. When chunks as large as 4 KB, the size of hard disk sectors, are lost, such damage normally makes it impossible to decompress the file at all.

Checking the integrity of whole files is quick and simple to achieve by computing a checksum or hash for the file, but that’s only the first step in resilience. If a file doesn’t match its checksum/hash, the more difficult tasks are identifying where the errors are and repairing them. Although there are perfect solutions, they’re so inefficient that it’s generally accepted that practical methods have low chances of failure rather than perfection.

Unfortunately, error detection and correction require that data is added to the file. For example, Hamming codes and their derivatives like Reed-Solomon codes add parity-checking bits; the more bits that are added, the greater the probability of errors being detected and corrected. Special consideration also needs to be made for the complete loss of chunks of data, as might occur with the loss of hard disk sectors of 4 KB. Even methods which considerably increase the amount of data stored face practical limits in the amount of data loss that is recoverable.

Quandary

The more efficiently you store data, using compression, the less recoverable it is from error and damage. The more resilient you make the data stored, the less efficient it becomes.

Examples

PDF file format is very old, dating back over thirty years. By modern standards, it’s antique. It consists of text files containing dictionaries of objects, many of which are compressed, although in its early days most, apart from embedded images, were retained in plain text too. Applying a further compression step to a PDF file seldom reduces its size much, because all the larger objects have already been compressed. Damage to the contents of objects is largely recoverable, although affected objects will be lost, but damage to its file structure is likely to result in total file loss. Limited error detection is built into the format, but there’s no inherent support for correction or recovery.

Making PDF files more resilient can only be achieved using external methods, so increasing their effective size.

Many modern applications now store raw data in open or proprietary XML formats, which are recognised as being highly inefficient. To reduce their storage requirements, these are now commonly compressed using Zip and other techniques. While their uncompressed originals can be repairable in the event of damage, any errors in the compressed data result in total file loss.

Applying further compression to those files seldom results in much saving, and at worst can increase their size. This is because most compression methods work best on average, which means that worst case results can produce a compressed file larger than the original. Few if any compressed XML formats use checksums/hashes to check integrity, and none use error-correcting codes. They too can therefore only be made more resilient by increasing their effective size.

APFS sparse files adopt a completely different approach. Used only for files in which much of their content is void, only non-void data is stored. This is extremely efficient, but has no resilience.

Implementation

There’s no single implementation which provides complete resilience. File systems which incorporate CRC error detection and correction on all stored data don’t provide any protection against lost sectors. RAID mirror storage provides useful protection from data loss only when that loss occurs locally on one of the mirrored disks, for example with sector loss on one hard disk. Combining methods can give good resilience, but only at the cost of efficiency, complexity and price.

Providing good resilience to all files in storage is usually unnecessary, particularly when using more reliable storage media such as SSDs, which tend to work normally or fail completely. In those cases, a combination of integrity checks and multiple independent copies of important files is probably the most efficient in terms of processing and storage required. This is already easily accomplished in a thorough backup scheme, where you should aim to have at least two local backups of all important files, and one off-site. Tagging them with a hash for integrity checking is quick and simple using utilities such as my free Fintch, Dintch and cintch. I’ve even been running a long-term test of this in iCloud Drive.

Rather than using computationally expensive error-correcting codes, the best answer for most Mac users may lie with safety in numbers.

Further reading

Stefan M Moser and Po-Ning Chen (2012) A Student’s Guide to Coding and Information Theory, Cambridge UP. ISBN 978 1 107 01583 8.
David JC MacKay (2003) Information Theory, Inference, and Learning Algorithms, Cambridge UP. ISBN 978 0 521 64298 9.