Explainer: Error-correcting files

Last week, I explained in some detail the principles underlying the most popular methods of compressing files. This article attempts to do the same for methods to enable errors in files to be corrected.

Detecting errors in a file is straightforward: all you have to do is calculate a checksum (such as CRC) or hash of the file contents, save that, and compare a new calculation with the original checksum/hash. Provided that the method used to calculate the checksum/hash isn’t subject to frequent collisions, in which more than one version of the file can have the same checksum/hash value, any difference demonstrates that the file has changed. Of course there are plenty of practical issues, such as where and how to store the checksum/hash, but the basic principle is robust and widely used.

What that doesn’t do is show where the error or corruption has occurred, or how to correct it, and those are the problems which prove more difficult to address. Simply storing two copies of the file only provides limited help, for example if neither copy matches the checksum/hash, and is highly inefficient. Parity bits can be used in RAID arrays to support error-correction, however they too prove inefficient: to be able to correct errors using a mirror pair of disks, the parity stripe or disk has to store one parity bit for every two bits. Total storage required for each file is therefore 2.5 times that of the original.

The first big breakthrough in error-correcting codes came in 1950, when Richard W Hamming published the code which now bears his name. To understand how this works, I’ll explain a simpler and highly inefficient ancestor.

One really simple way to detect (but not correct) errors is to repeat every bit. Provided the chance of error is low, we can be reasonably confident that any pair of bits which isn’t the same must be an error. However, to be able to correct that error we need a third copy of the bit. Then we know that the triplet {0 0 0} is correct, but {0 0 1} contains one error, and the chances are that it should be {0 0 0} instead.

This breaks down when the error rate gets high, as {0 0 1} could contain two errors, although we’ll still be able to detect that as an error, even if correcting it might fail.

This is another form of code, which works as follows:

  • if we read any of the codes {0 0 0}, {1 0 0}, {0 1 0}, {0 0 1} then they’re most likely to represent the single bit 0
  • if we read any of the codes {1 1 1}, {0 1 1}, {1 0 1}, {1 1 0} then they’re most likely to represent the single bit 1.

This code maps 1 bit of original data to 3 bits of code, so is known as a (3, 1) error-correcting code, with a code rate of 1/3. It’s practically useless, as applying it to any file triples its size. What Hamming accomplished using complicated maths was a code which is a (7, 4) error-correcting code, giving a code rate of 4/7 and greater efficiency. This works by taking four bits {s1 s2 s3 s4} and encoding them to a codeword of seven bits {p1 p2 p3 s1 s2 s3 s4} in which p1, p2 and p3 are calculated as:

  • p1 = s1 + s3 + s4
  • p2 = s1 + s2 + s3
  • p3 = s2 + s3 + s4

where bit addition doesn’t ‘carry’, so 1 + 1 = 0.

Hamming code can correct all single-bit errors, and will detect two-bit errors as well. Wikipedia’s excellent account, complete with explanatory Venn diagrams, is here.

Ten years after Hamming code, came Reed-Solomon code (R-S), invented by Irving S Reed and Gustave Solomon. Nearly twenty years later, when Philips Labs were developing the format for CDs, their code was adopted to correct errors in their reading. Unlike the codes I have discussed so far, when used in CDs, R-S codes are applied to bytes rather than bits, in two steps.

The first encodes 24 B of input data into 28 B of code. Interleaving is then performed on those codewords in blocks of 28, or 784 B of codewords, following which a second R-S coding is performed to convert each 28 B into 32 B of code. The overall code rate is thus 24/32, so an input file grows by a third following this double encoding. R-S code is explained in detail in this Wikipedia article.

The reason for such a complex scheme of error-correction in CDs is to correct bursts of errors up to 4 Kb, or 500 bytes, representing about 2.5 mm of the track on a CD. Similar methods have been used for DVDs and in parchive files, which were distributed in USENET posts. However, it becomes progressively harder and less efficient to provide error-correction for larger blocks of missing data, which is of course one of the most serious problems in computer storage systems.

As many hard disks now have sector sizes of 4 KB, the method used for CDs wouldn’t offer any hope of recovering the loss of even a single sector of data. However, combining a scheme like R-S with a robust backup strategy should give high resilience, even though it’s not particularly efficient in its use of storage space.