hoakley June 4, 2022 Macs, Technology

Explainer: Error-correcting files

Last week, I explained in some detail the principles underlying the most popular methods of compressing files. This article attempts to do the same for methods to enable errors in files to be corrected.

Detecting errors in a file is straightforward: all you have to do is calculate a checksum (such as CRC) or hash of the file contents, save that, and compare a new calculation with the original checksum/hash. Provided that the method used to calculate the checksum/hash isn’t subject to frequent collisions, in which more than one version of the file can have the same checksum/hash value, any difference demonstrates that the file has changed. Of course there are plenty of practical issues, such as where and how to store the checksum/hash, but the basic principle is robust and widely used.

What that doesn’t do is show where the error or corruption has occurred, or how to correct it, and those are the problems which prove more difficult to address. Simply storing two copies of the file only provides limited help, for example if neither copy matches the checksum/hash, and is highly inefficient. Parity bits can be used in RAID arrays to support error-correction, however they too prove inefficient: to be able to correct errors using a mirror pair of disks, the parity stripe or disk has to store one parity bit for every two bits. Total storage required for each file is therefore 2.5 times that of the original.

The first big breakthrough in error-correcting codes came in 1950, when Richard W Hamming published the code which now bears his name. To understand how this works, I’ll explain a simpler and highly inefficient ancestor.

One really simple way to detect (but not correct) errors is to repeat every bit. Provided the chance of error is low, we can be reasonably confident that any pair of bits which isn’t the same must be an error. However, to be able to correct that error we need a third copy of the bit. Then we know that the triplet {0 0 0} is correct, but {0 0 1} contains one error, and the chances are that it should be {0 0 0} instead.

This breaks down when the error rate gets high, as {0 0 1} could contain two errors, although we’ll still be able to detect that as an error, even if correcting it might fail.

This is another form of code, which works as follows:

if we read any of the codes {0 0 0}, {1 0 0}, {0 1 0}, {0 0 1} then they’re most likely to represent the single bit 0
if we read any of the codes {1 1 1}, {0 1 1}, {1 0 1}, {1 1 0} then they’re most likely to represent the single bit 1.

This code maps 1 bit of original data to 3 bits of code, so is known as a (3, 1) error-correcting code, with a code rate of 1/3. It’s practically useless, as applying it to any file triples its size. What Hamming accomplished using complicated maths was a code which is a (7, 4) error-correcting code, giving a code rate of 4/7 and greater efficiency. This works by taking four bits {s1 s2 s3 s4} and encoding them to a codeword of seven bits {p1 p2 p3 s1 s2 s3 s4} in which p1, p2 and p3 are calculated as:

p1 = s1 + s3 + s4
p2 = s1 + s2 + s3
p3 = s2 + s3 + s4

where bit addition doesn’t ‘carry’, so 1 + 1 = 0.

Hamming code can correct all single-bit errors, and will detect two-bit errors as well. Wikipedia’s excellent account, complete with explanatory Venn diagrams, is here.

Ten years after Hamming code, came Reed-Solomon code (R-S), invented by Irving S Reed and Gustave Solomon. Nearly twenty years later, when Philips Labs were developing the format for CDs, their code was adopted to correct errors in their reading. Unlike the codes I have discussed so far, when used in CDs, R-S codes are applied to bytes rather than bits, in two steps.

The first encodes 24 B of input data into 28 B of code. Interleaving is then performed on those codewords in blocks of 28, or 784 B of codewords, following which a second R-S coding is performed to convert each 28 B into 32 B of code. The overall code rate is thus 24/32, so an input file grows by a third following this double encoding. R-S code is explained in detail in this Wikipedia article.

The reason for such a complex scheme of error-correction in CDs is to correct bursts of errors up to 4 Kb, or 500 bytes, representing about 2.5 mm of the track on a CD. Similar methods have been used for DVDs and in parchive files, which were distributed in USENET posts. However, it becomes progressively harder and less efficient to provide error-correction for larger blocks of missing data, which is of course one of the most serious problems in computer storage systems.

As many hard disks now have sector sizes of 4 KB, the method used for CDs wouldn’t offer any hope of recovering the loss of even a single sector of data. However, combining a scheme like R-S with a robust backup strategy should give high resilience, even though it’s not particularly efficient in its use of storage space.

29Comments

Add yours

1

Maurizio on June 4, 2022 at 7:45 am

this way of talking about computer science awakens good memories in me. Thanks

LikeLiked by 1 person
- 2
  
  hoakley on June 4, 2022 at 8:33 am
  
  Thank you.
  Howard.
  
  LikeLike
3

Duncan on June 4, 2022 at 3:59 pm

Howard, thank you for pursuing this topic, which is a high priority for me. In the interest of my not jumping the gun in this comment, will you also be discussing storage considerations in a followup?

(I’d like to talk about the practical options for achieving true data redundancy, with considerations for the currently-high costs of SSD storage.)

LikeLiked by 1 person
- 4
  
  hoakley on June 4, 2022 at 7:12 pm
  
  Thank you.
  I wasn’t intending to, but perhaps it’s a good topic for a future article?
  Howard.
  
  LikeLike
  - 5
    
    Duncan on June 4, 2022 at 8:37 pm
    
    Well, my personal interest is in achieving as much file integrity as reasonably possible for my lifetime’s accumulation of data. However, I’d rather have the computer perform all the checking and automatically fix things – after all, isn’t that the sort of thing they should be good for?
    
    I’m willing to invest in extra (redundant) storage to accomplish this, which we all do anyway to a certain degree by maintaining separate backups. Also, my data-growth isn’t increasing greatly over the years (I’m not accumulating TBs of new video, for example), so for me this isn’t a data-center’s scale of investment.
    
    Ultimately what I’d like to end up with is a system where my data resides on however much storage (ideally SSDs) is necessary that I simply don’t have to worry about it anymore, from here on out. Just as computers currently index our files for searches and backup data to separate media, all in the background, I’d like for them to also verify the integrity of the data in the background and correct any bit errors.
    
    (CCC sort of does this now with its latest version, but only comparing the backup to the source data and calling out errors. It cannot, however, fix any bit errors on the source data, which is what I want to happen.)
    
    I know that Apple hasn’t provided the tools to accomplish this, and trying to cobble together a set of utilities and scripts *currently* could be more trouble than it saves. But if I can at least see a path towards a ‘total’ solution and watch for OS/storage developments that put the pieces within reach I’d like to know what might be possible.
    
    (And no, I’m expecting you to solve this problem for me, but your explainers certainly provide insights into the current state of technology, and what might be available in the future.)
    
    Thank you for your continuing research and writings on all this.
    
    LikeLiked by 1 person
    - 6
      
      hoakley on June 4, 2022 at 9:20 pm
      
      Thank you.
      What are others here interested in, with respect to maintaining file integrity?
      Howard.
      
      LikeLike
7

Baron on June 4, 2022 at 10:33 pm

We are probably many here having a lot of precious photos and memories, digital only, that we want to preserve through the years without hassle.
Data integrity is certainly valuable in this perspective.

LikeLiked by 1 person
- 8
  
  Duncan on June 4, 2022 at 11:01 pm
  
  I don’t see how data integrity can be ignored anymore. We’ve got computing power to spare, and storage keeps getting faster and larger in capacity. We’ve even got algorithms to work with (as this very article explains). So it’s really just a matter of will, and corporate priorities.
  
  Quite honestly, I can’t see how companies like Apple, Microsoft, Google, etc. run their own in-house operations, with petabytes of data at risk, without having solved this for their own sake. Do they simply accept bit-errors as part of their data-keeping? “Oh well, nothing could be done, so I guess we lost those files…”
  
  LikeLiked by 1 person
  - 9
    
    John Gilbert on June 5, 2022 at 12:55 am
    
    You don’t have to ignore it. You can outsource data integrity. For example, store copies of your data in Backblaze B2 storage which claims 11 9’s durability. Read about it here https://help.backblaze.com/hc/en-us/articles/218485257-B2-Resiliency-Durability-and-Availability (and in other longer articles). They use a modified form of Reed-Solomon encoding. At that level of durability, other risks (like destruction due to nuclear war) become more likely.
    
    LikeLiked by 1 person
    - 10
      
      Duncan on June 5, 2022 at 2:41 am
      
      While your copies might be just fine, nothing is checking the integrity of the *originals*. That’s something that should be happening continuously, in the background, without any user actions.
      
      The hashes should be created at every ‘Save’ operation, or verified whenever a file is copied/moved to a different volume. New/changed files should be checked as the first step of every backup, before getting written to the backup location where the copy is subsequently checked. (Unchanged files can be checked periodically at opportunistic intervals, depending on user preferences or other criteria.)
      
      This may seem like a lot of file-system busy-work, but similar calculations are performed continuously with RAID systems and no one bats an eye at that. (And let’s not get into the continuous calculations performed to maintain an encrypted disk.) Ideally all this would be handled by dedicated hardware silicon, just like with video encoding and, yes, encryption.
      
      LikeLiked by 1 person
    - 11
      
      hoakley on June 5, 2022 at 8:42 pm
      
      Thank you.
      Isn’t this just down to cost and benefit? When corrupted files were a common occurrence, procedures and hardware to mitigate their risk was essential. As processing and storage has become increasingly reliable, so the need for high integrity computing has become more specialised.
      An example of this is in HFS+ and APFS. The latter introduced copy-on-write to minimise the risk of errors developing in the file system, something which we had to live with in HFS+, even after the introduction of journalling.
      Howard.
      
      LikeLike
    - 12
      
      hoakley on June 5, 2022 at 8:38 pm
      
      Thank you.
      With any commercial operation, business failure is far more likely than nuclear war, I would hope.
      Howard.
      
      LikeLike
  - 13
    
    hoakley on June 5, 2022 at 8:37 pm
    
    Thank you.
    It’s also a matter of risk and mitigation. We can never rule out data loss, only make it highly improbable.
    Howard.
    
    LikeLike
- 14
  
  hoakley on June 5, 2022 at 8:35 pm
  
  Thank you.
  Howard.
  
  LikeLike
15

Joey Jay on June 5, 2022 at 3:19 pm

After investing countless thousands of dollars in the best hardware, software, and services, I can confirm that when data integrity fails, all of that tumbles down like a house of cards.

For example, consider ONE priceless, precious photo you took in 2005 with a digital camera… the kind of photo that would make you legitimately suicidal if you lost forever. Then think about the following scenarios:

1. Something went wrong inside your photo management app, or you hit a key combination accidentally. The photo is moved to Trash without you being aware of it. After a 30-day grace period, the photo management company empties its trash. Your backup program keeps running as normal, and retains a copy of the deleted photo as your storage space is cycled, but not forever. After a few months, your oldest backup with the photo intact is finally deleted and replaced with backups that don’t have it anymore, But because you never new it was deleted, you won’t find out it’s gone forever until you go to look for it. And then it’s too late.

2. You power fluctuates briefly during a write operation to your data drive. Unknown to you, an entire directory of files becomes mangled… not the lookup tables or reported metadata of each file, but the data inside the files themselves. These now unreadable files get backed up over and over until the older versions are gone. No errors are reported by the file system because it doesn’t know what it would compare your files against to even know there is a problem. And the backup error check system detects nothing because the files it writes to on the backup still match what is on your data drive. So whenever you get around to trying to access that directory (and that could be 10 years or more), you have a very bad day coming as you struggle to load backup after backup going back 6 months only to discover the errors have propagated.

3. You have been a computer hobbyist since the first personal computers of the 1970s and have meticulously maintained your data with backups and never lost anything, ever. Until you realize in 2022 that the app you need to open a proprietary data file from 1984 no longer exists, and the operating system and hardware no longer exist to run the app as it was. So now the data is useless.

I do not know the answer to these problems. But all data is subject to undetectable corruption, loss, and obsoletion from the moment it is first written to a digital storage mechanism. And if humans ever go thorough a global mass casualty event, we may only be left with paper photos, analog records, slides, and strips of celluloid.

LikeLiked by 1 person
- 16
  
  Duncan on June 5, 2022 at 4:20 pm
  
  Those are all true examples of data loss but none pertain to this particular topic, which discusses the potential degradation of once-intact files. Scenario 1 above talks about data retention issues (not necessarily corruption); scenario 2 is power failure during write, so the data is corrupted *before* a clean copy is even generated (or saved); and scenario 3 discusses app/file compatibility over time. All are real problems, but none could be solved with the application of error-correction.
  
  (As a matter of interest, there are groups trying to address the problem of long-term data archiving, and by ‘long term’ they mean *thousands* of years. One such project:
  
  https://rosettaproject.org
  
  Unfortunately, that particular approach is not practical for giant volumes of data, so it is necessarily selective in its choice of material to work on. But if such archives were a civilization-wide priority – and spoiler: they are not – the examples described could be a viable means of preserving human knowledge past a catastrophic event.)
  
  LikeLiked by 1 person
- 17
  
  hoakley on June 5, 2022 at 8:51 pm
  
  Thank you.
  I have no photo or any data which would make me “legitimately suicidal” if I lost it forever.
  I don’t use a photo management app. I keep original RAW files and derived JPEG images of all my photos, in several physically separate stores. It would take me a lot of effort to remove any image from all of them.
  I have a UPS. I also use APFS, which is specifically designed to cope better with events such as power problems, in using copy-on-write. I also have multiple backups, including off-site copies of all my more important files.
  I save all my important files in non-proprietary formats, including text, CVS, JPEG, PDF, etc. I not infrequently access files from the last century – I’ve only been doing this recently with papers I wrote and published in the mid 1990s.
  I’m far from perfect, and have plenty of room for improvement, but none of this is difficult.
  Unlike paper photos, analogue records, slides and strips of celluloid, none of my files have ever gone mouldy, been damaged by damp, or been burned. And I can keep multiple copies of them with great ease, and distribute those to different physical locations, even the cloud.
  Howard.
  
  LikeLike
  - 18
    
    Joey Jay on June 5, 2022 at 9:16 pm
    
    You say you keep multiple copies on separate physical data stores. How do you keep all this in sync? What happens when a file becomes corrupt and you unknowingly propagate it to the other stores when you do a sync? If you do decide to remove an image, how is that accomplished? Manually from each storage area? UPSes fail. If you’ve ever seen a bolt of lightning come into the house via a power receptacle and cross over the ceiling, you’d realize a UPS can only do so much. This happened to me during a storm and I couldn’t believe what I was seeing. Keeping your data safe from mistakes and failures IS very difficult. I made a living as an IT manager and I use enterprise-grade hardware in my home. And I still lose data regularly. If it’s this hard for me to figure out, I shudder at the thought of families everywhere losing precious memories, even if it’s only because they don’t understand how to take even the most basic of precautions.
    
    LikeLiked by 1 person
    - 19
      
      hoakley on June 5, 2022 at 9:38 pm
      
      I have multiple backup systems running. Every hour, Time Machine automatically backs up my main working folder to one store. Every night, CCC makes complete copies of that folder together with other volumes to separate storage. Both make snapshots too. I also manually copy my most important working documents to my iCloud Drive periodically, depending on work progress.
      As most of my important working documents are text, or marked up text, corruption poses a low risk. Text and marked-up text is very low risk in that respect.
      The great bulk of important images that I work with are, as you might have noticed, available in Wikimedia Commons. Although I keep local copies of those that I prepare for use here, I can easily download replacements.
      Everything can fail, sure. But if this house were to take a direct lightning strike, the fate of my UPS would be the least of my concerns. Most domestic properties struck by lightning also catch fire as a result. Thankfully, we live at the bottom of a valley, surrounded by high chalk downs with an array of radio masts on them (they were one of the original WWII radar sites for Chain Home, I think it was). So in our case flooding is a greater risk than a direct lightning strike.
      What I have learned, though, as a result of my active service in the Falklands War, an expedition to overwinter in the Antarctic, service on a warship in the Gulf, and other situations in life, is that there is no certainty. You can never eliminate risk, just mitigate it.
      Last night we had thunderstorms forecast, so for once I shut down my iMac Pro and disconnected my Mac Studio. Then I slept like a log, and we had barely a drop of rain and the nearest thunderstorm passed about 20 miles to the west.
      Howard.
      
      LikeLike
    - 20
      
      Duncan on June 6, 2022 at 3:13 pm
      
      Howard: “an expedition to overwinter in the Antarctic…”
      
      You spent a winter in Antarctica? We have friends who have been working there almost every year for over a decade, and one year ‘wintered on the ice’. Apparently you have to pass a psychological evaluation (at least for the US) because the consequences of some sort of psychotic break (or any other serious medical situation) is pretty dire when there is no way to leave in the winter months.
      
      I am impressed once again, Howard!
      
      LikeLiked by 1 person
    - 21
      
      hoakley on June 6, 2022 at 3:33 pm
      
      Yes, I was the doctor and deputy leader of an expedition which was the first to overwinter there entirely in tents. We didn’t do any formal testing before selecting members, but interviewed them (they were almost all service personnel, as I was at the time) and we went on some fairly rugged pre-expedition training together. We only made one mistake in selection.
      As a doctor, overwintering was quite a challenge. We managed to avoid serious incident, but the final summer part lost its leader down a crevasse with a fractured femur, which was interesting – but as it was summer, there were plenty of rescuers available and he recovered fine.
      The expedition was how I got my first PC, but that’s another story.
      Howard.
      
      LikeLiked by 1 person
  - 22
    
    Baron on June 6, 2022 at 12:21 am
    
    Thank you for all the explanations you gave. I totally agree with your archiving solution. However, I also value a printed copy of particuliar documents, mainly texts and photos, but also some spreadsheets – even if I’ve had to complete them by hand and perform calculations myself, I found it helpful recently: as my main Mac died suddenly, my other Macs (not the same generation) weren’t of great help to use some current files for which I hadn’t saved a generic version, and waiting for my new Mac to ship before I reconfigure properly my jobstation, I was happy to have this printed copy under the thumb.
    It’s a singular case, but it’s the second time (in thirty years…) that a printed copy of a document proved to be the most usable backup.
    
    Although they have their own shortcomings (humidity, fire, storage space, etc.), traditional media – because of their radically different nature – are for me a reassuring duplicate for certain data.
    
    LikeLiked by 1 person
23

bdmarsh on June 6, 2022 at 3:23 pm

I’d be interested at the efficiency of ZFS and it’s RAID options?
with Copy on Write, Checksum (Fletcher-4 maybe 1982, or SHA-256 from 2001) of all data and metadata, Snapshots, Mirroring or RaidZ1, RaidZ2, or RaidZ3 options. There is extra overhead, but on more modern scales (within the last 10 years) it seems to be pretty small compared with the size of data and hard drives. Not sure it would be worth it on things under 1 TB in size.

I’ve had it save “bit rot” on files at home and work (failing blocks on disks) and identify issues when a disk controller started to fail (that other OS thought was fine, until you one to access written data) – fortunately an easy replacement once I was alerted, and it rebuilt that pool.

LikeLiked by 1 person
- 24
  
  hoakley on June 6, 2022 at 3:37 pm
  
  Unfortunately, ZFS support on Macs isn’t simple, and not something even the advanced user should entertain lightly. I did try ZFS out briefly before APFS became available, but haven’t looked at it since. While it might be fine in the right hands for a server, as you can’t boot a Mac in ZFS, it has limited value at present. I’m also unsure whether it’s possible to use Btrfs, but the same caveats apply.
  Howard.
  
  LikeLike
  - 25
    
    bdmarsh on June 6, 2022 at 4:14 pm
    
    Yeah, that was a huge disappointment that they didn’t continue ZFS support under MacOS.
    
    I did not specify I use it on my NAS (TrueNAS in my case) for household storage & backups.
    
    LikeLiked by 1 person
    - 26
      
      hoakley on June 6, 2022 at 10:35 pm
      
      I think ZFS and Btrfs are great for NAS and similar server systems. But there you’ll most probably be backing up to a sparsebundle containing a virtual APFS file system, so you don’t lose its benefits.
      Howard.
      
      LikeLike
27

Joey Jay on June 6, 2022 at 4:17 pm

And here I thought I was elite by flying in from Ushuaia and spending the day on King George Island…

LikeLiked by 1 person
- 28
  
  hoakley on June 6, 2022 at 10:39 pm
  
  I didn’t visit King George Island, but South Georgia, which I strongly recommend. The scenery is totally awesome, and the eery graveyard at Grytviken with the remains of young Norwegian men who died at the whaling station there is another remarkable place to visit.
  When we were there, we ran a half marathon along the rocky beach. Fur seals were the main hazard of the course!
  Howard.
  
  LikeLike
  - 29
    
    Joey Jay on June 7, 2022 at 6:46 pm
    
    Then I shall return!
    
    LikeLiked by 1 person