Should we take bit rot seriously?

We all know that, over time, the documents and other data we store become gradually corrupted by ‘bit rot’. In its broadest sense, this is data degradation, and is the result of imperfections in storage media. But over what period is bit rot a significant risk? Are we talking months, years or centuries?

Bit rot and measures to protect from it are one of the most controversial topics in computing. Look at most accounts and they’re full of sweeping assertions, technical descriptions of how bit rot might happen, but the only figures they’re ever likely to quote are drawn from manufacturers’ specifications or old studies on hard disks which are almost certainly irrelevant to modern storage media, even hard disks.

If you can’t measure it, does it even exist?

When researching this article, I looked for contemporary measurements of bit rot rate on current storage media, and have been unable to find a robust scientific study which might yield such figures. That isn’t to say that it hasn’t been done, but if it has someone’s keeping very quiet about the results. This is in spite of three modern file systems – notably ZFS, Btrfs and ReFS – all incorporating methods designed to detect bit rot.

Detecting bit rot in your sole copy of an important file isn’t particularly helpful either. Much better are error-correcting codes (ECC), which can repair any error when it occurs, in a self-healing file system. Even that is of little help if the medium is write-once, as is most likely in most archival storage.

One storage medium which should benefit most from a file system designed to detect bit rot and to self-correct using ECC is the hard disk. Few survive three years without developing some errors, so even if you replace them at that point, it’s likely that some repairs will have been necessary, or there will have been some data loss. Hard disks are also among the poorer-performing media in widespread use, and the overhead of ECC and other safety measures could only worsen that.

Without much better information on risks and rates of bit rot in SSDs, it seems hard to justify imposing overhead on their performance just in case it might be significant.

A better strategy for SSDs and write-only media such as archival Blu-ray disks is surely to monitor file checksums periodically, falling back to a copy of any file whose checksum changes. That in turn requires a simple GUI method of scanning a volume and comparing each of the files with their expected checksum. I’m thinking about that a great deal just now.

Further reading

Wikipedia, a brief article which explains what can happen
Jody Bruchon’s article arguing that most measures against bit rot are a waste of effort
ProStorage claiming the exact opposite.