Beyond Time Machine: 5 Archiving

Most of us have birth certificates and other documents, perhaps even from our ancestors, which go back fifty years or more. Yet it’s extremely unusual to find computer files which were created before 1990, apart from those with ‘null’ creation dates of 1 January 1970. We tend to think of electronic records as being contemporary and recent past. But where are you going to store them ‘in a safe place’ for the future?

Archiving is about moving old documents and records from current storage to somewhere more permanent, which frees up active disks and ensures long-term access to those files in ten, fifty or even five hundred years. It’s one of the most difficult problems in computing, and one which most of us have yet to tackle seriously.

Storage medium

Traditional historical archives consist almost entirely of documents printed and written in ink on paper. There’s simply no equivalent electronic storage medium which can offer high-density and proven permanence. Currently the best bet is some form of optical disk, either DVD or Blu-ray. Although DVDs have been very widely used, their capacity is generally too low for modern use. Blu-ray can readily cope with 50 or 100 GB on each disk, but those disks aren’t cheap: ‘archival quality’ BD-R disks typically cost around $/€/£ 5-10 each.

You should also be extremely sceptical about claims of their longevity. Some manufacturers quote one hundred or one thousand years ‘archival life’, which assumes that, over the long term, the disks age as expected from accelerated ageing tests, and that they remain stored in optimal conditions for the whole of that period.

Matching the writer drive specification with media isn’t alway easy. M-DISC has become a popular choice among those seeking greater longevity, and has both theory and manufacturers’ claims to support that, but many Blu-ray writers still don’t support burning M-DISCs, and you’ll need to choose yours with care. As no Mac has ever come with a built-in Blu-ray writer, and none now supports internal optical drives at all, there’s a wide choice of external models available.

To do the job thoroughly, you should burn two copies of everything you intend to archive and verify them. If any fails the verification step, discard that disk and burn another. Opinions are mixed as to the optimal burn speed: some advocate using the slowest possible, down to 1x, but apart from making each burn much longer, you’re probably best off using the rated speed of the disks. My previous experience with optical drives is that there’s a sweet spot for each writer and type of disk which is the best compromise between performance and reliability.

There are also some specialist products for the archiving market such as Sony’s Optical Disk Archive, offering write-once storage of up to 5.5 TB in an eleven-disk cartridge, and Blu-ray based Archival Disk which may reach 1 TB on a single disk. These are significantly more expensive, have much lower adoption rates and are dependent on hardware and software which may not prove as enduring as plain Blu-ray.

File formats

Proprietary and binary file formats are notoriously short-lived, but invaluable when you want to access the contents of complex documents. The best compromise in an archive is to keep at least two different versions of each document, one in its original format, or exported into a format which can be accessed by other apps too, and one in a more enduring format even if that limits access to its components.

Standards which are likely to be reliable during our lifetimes at least include:

  • ASCII and UTF-8 for text files,
  • JPEG and PNG for still images,
  • XML-based open document standards,
  • CSV for data,
  • PDF provided that it complies with one of the archival standards PDF/A-1 to /A-3.

Video, audio and rich media should use widely-used compressors which are likely to remain available in the future.

Beware when trying to store very large numbers of files to a single Blu-ray disk. Tens of thousands of files should write in an acceptable period, but many more than a hundred thousand may not be practical. Consider tarring (using tar, pax or cpio) smaller files together if necessary. Avoid any form of compression, which can amplify the effects of bit rot, making it impossible to extract even plain text files.

Indexing and access

Looking through archives only a few months later can be a salutory experience. Even when structured rigorously, finding individual documents can be extremely frustrating. Each archived disk needs a full and structured list of contents stored on it in an accessible format such as UTF-8 text, and a printed summary should be stored with the disk to save you from having to mount disks in turn to look for documents on them.

More sophisticated archives make extensive use of metadata and document retrieval systems which have been developed for digital libraries and the like. Although several such as Greenstone are free, they are not intended for casual use.

Physical storage

Archive optical disks should be stored in cases with centre hub security, not in sleeves. They must be kept in a cool, dry and dark container, in which there is no mould or fungus. They also need to be protected from physical threats such as flood and fire. Popular furniture for achieving this are firesafes, but you must then ensure that their combination or keys are readily available and not separated from the firesafe.

Don’t print on the disk itself, and keep paper records alongside the disks in the same container, but not inside the cases themselves.

Checking

If you’re serious about maintaining your archive, you should check each of its disks once a year. This also gives you a chance to review whether the rate of bit rot is sufficient to make it wise to move the archives to new media.

This may all seem a great deal of work. However, it pays off when you access anything in your archive. If you can’t do so, then it has all been a waste of time and money.

Further reading

Wikipedia point of entry
British Library digital preservation site.