File Integrity 9 : How error-correcting codes work

If we want to protect important documents from being corrupted or lost, it’s no good just knowing whether any particular copy of a file has been corrupted. You need some means of recovering or repairing the file so that your master copy is intact and fully usable.

One simple way to do this is to store two complete copies, something commonly done using RAID mirroring. When it works perfectly, it’s great, but is inefficient in terms of storage: you need twice the (total) space to store that file. When those can be 100 GB video, that quickly gets seriously expensive.

There’s another snag with a simple mirror: unless you monitor each copy of the file to ensure it remains intact, you could end up with both copies being corrupted. Although this might seem unlikely, most RAID mirrors are made up of hard disks. Once over about 3 years of age, the chance of one of the disks failing in the next year could be as high as 20%. Normally, that would mean the chance of both failing would be 4%, but in practice hard disks in RAID systems are usually from the same batch, and batch members are likely to fail at around the same time, so the chances of both disks failing in the fourth year could be 10% or greater.

Another recognised problem is that RAID mirrors, when they work properly, write exactly the same data to each of the disks. If the corruption is introduced at or before that stage, rather than in the RAID disk, then you will have two identically corrupt copies of the file.

One slightly smarter way to use copies of a file is to monitor not just the whole of each copy’s integrity (using checksums or digests), but to calculate those for parts of each file. Then if only the first block of a file has become corrupted, you don’t need to find a completely intact copy, but one which has that first block intact, by matching its partial checksum. However, that still doesn’t get around the requirement that you need at least two copies of each file.

The solution to this efficiency problem comes from coding theory – which covers data compression, encryption, error correction, and more. This encodes files in such a way that enables errors and corruption to be repaired with more efficient use of storage than by keeping duplicates. Much of this developed from work to recover data from noisy radio signals, so the introductory examples here apply to signal coding and to stored files with varying degrees of relevance.

In transmitted data, a similar approach to mirroring files is to send each bit in the data twice. To send a zero bit, you thus send two zeroes. If the receiver then receives two zeroes, it knows that they should represent a zero bit. What, then, if the receiver gets one of each, either 0 1 or 1 0? That is clearly an error, but could either have been 1 or 0 originally.

ecc1

So a twice-repetition code can detect a single error bit, but can’t correct it. If both bits are errors, it won’t even detect it.

The next step in what are known as linear codes is to try repeating each bit three times. So to send the zero bit, you send three zeroes. If you look at the table below, you’ll see how this is more useful.

ecc2

A three-times repetition code will not only detect but also correct all single-bit errors, recognising that 0 0 1 should be decoded to 0, for instance. It does this by mapping 1 bit into 3 bits, which is conventionally expressed as it being a (3, 1) code with a ‘code rate’ of 1/3. In storage terms, for each 1 bit of data it would require 3 bits of storage, which is even less efficient than mirrored disks.

To start seeing where ECC gets to perform significantly better, we move onto a slightly more complex scheme, Hamming code. For this, messages or files are divided into groups of 4 bits, to which 3 bits are added to form the code. These are shown in the table below.

ecc3

To send or store the four bits 0 1 1 1, we code those with three initial bits 0 0 1, making the complete code 0 0 1 0 1 1 1. In files, those three added bits are often known as parity data, and enable the code to correct all single-bit errors, and to detect all two-bit errors.

For example, if the code 0 1 1 1 1 0 0 is received, that doesn’t match any of the fully-correct codes. The only single-bit change this could represent is the code 0 1 1 0 1 0 0, which is decoded to 0 1 0 0. That could also have resulted from errors in two bits, although if that were the case, there’s more than one possible decoding. So as well as correcting all single-bit errors, the Hamming code also detects (but can’t correct) two-bit errors. It’s a (7, 4) code with a code rate of 4/7, and in storage terms for each 4 bits of data it would require 7 bits of storage. That’s more efficient than mirrored disks, although it isn’t able to correct as many errors.

Richard Wesley Hamming (1915-1998) published his code in 1950. Ten years later, Irving S Reed and Gustave Solomon developed what’s known now as Reed-Solomon code in their honour. These are still used in audio CDs, and can squeeze good levels of error correction for 28 bytes of raw data in just 32 bytes of storage.

Unfortunately, the maths of even basic Reed-Solomon codes is extremely complex and involves juggling polynomials, and to explain them fully invokes Galois fields, which would be a terrifying prospect for any of us.

Just like an audio CD, what a Mac can do is take the original file, divide it up into a series of blocks, then encode the data into a combination of ‘parity’ and original data. These can be stored together or, as in the Parchive format, in separate recovery blocks or parity files. What’s more, you can structure the data in those files so that they too can withstand modest levels of corruption.

There are theoretical and practical limits to what can be achieved with ECC. Obviously, a file whose contents have been completely replaced with zeroes or random bytes can’t be recovered from ‘parity’ data, whereas an intact copy of a file can replace a lost master copy. Depending on how well the chosen ECC performs, you can then make a decision as to the balance struck between ECC and complete copies: they are complementary techniques which work best to guard against different types of loss or damage.

3Comments

Add yours

1

Bob on May 20, 2020 at 9:17 pm

Brings back memories of questions in grad school, like, “Given the following CRC-32, provide a sample input with an error that it will not detect.” To paraphrase Fox Mulder, “The errors are out there.” I have fond memories of Hamming codes, Reed-Solomon, Gray codes (A favorite on optical encoder wheels) etc. In a different post, you asked me why not rely on ECC instead of my “massive” backup solution of double backups. Fair question. I like not worrying about file system corruptions, and no amount of ECC will save a bad disk. Well, some amount will, but let’s not go there.

However, I have toyed with a redundancy scheme that would only require 50% of the original file size, without considering any compression scheme. Consider the humble exclusive-or function:

A xor B = C
A xor C = B
B xor C = A

This is powerful stuff. It means I can take any two blocks from a disk, XOR them and store the result in half the space on a backup drive. The only way to lose the data is if two of the three blocks are unreadable. I suppose we could also achieve this at the file level: “Fold” the file onto itself:
For a file of blocks 1..N:
1 XOR N -> backup block 1
2 XOR (N-1) -> backup block 2
etc.

Thus any bad block can be replaced, at a cost of 50% of the original data size. This is also a blisteringly fast process given that XOR would be a single CPU instruction executing in one clock cycle. To dream, perchance to create…

And this serving as a massive hint, I would leave you with the puzzle I love to ask newly-minted employees: How can you swap the contents of two CPU registers without using a third register, or memory?

LikeLiked by 1 person
- 2
  
  hoakley on May 20, 2020 at 10:36 pm
  
  Thank you.
  Your XOR and fold schemes only work if one of the three blocks becomes unreadable. If one remains readable but the contents are corrupt, then I think the scheme is incapable of recovering the other two blocks, isn’t it?
  As for complete disk failure, of course no amount of ECC can recover completely missing data. Isn’t that why we make a backup (and an archive, and an off-site backup, where appropriate) in the first place? But having to have more than one backup is surely inefficient.
  Clearly the answer to your last lies in using XOR. Maybe when it’s not so late at night I’ll try to work that out!
  Howard.
  
  LikeLike
3

Bob on May 20, 2020 at 10:57 pm

Oh yes, to be sure, checksums are still necessary to tell you that a block is bad.
As for file folding, I didn’t really explain the rationale: to provide geographic (physical) distance between the blocks; it would not be unusual for a media error to take out adjacent blocks. For small files, not useful, and folding doesn’t help the center of the file. One could interleave blocks like streaming protocols use, to avoid a bad moment of transmission from taking out a large block of audio, instead spreading it out on the time domain to render it as nothing more than static. Thus applied to blocks, we would encode blocks:
1, 25, 50, …
2, 26, 51, …
3, 27, 52, …

such that a media error is less likely to take out a block pair.

As an aside, someone once asked me why there are so many different checksum methods. I explained that each method has strengths and weaknesses, so the right checksum depends upon the most likely type of error you’re trying to catch. For example, the checksum on UPC (barcode) symbols is bad at catching transpositions, which us humans see as a very common error. But a barcode is read by a machine and transpositions are extremely unlikely. Strength where it counts.

LikeLiked by 1 person

Share this:

Related