Last week, I explained compression methods in general terms, carefully avoiding explaining how they work. Although that quickly gets very technical and of limited value, this article explains the basis for the most popular non-lossy compression schemes. This deeper understanding should help you appreciate their strengths and weaknesses, and help guide your choice between different methods.
To illustrate this, I’ll take a hypothetical data storage problem. I have 100,000,000 results from a survey of which version of macOS different users are currently running. Each contains one of four possible answers:
- macOS 12 Monterey
- macOS 11 Big Sur
- macOS 10.15 Catalina
- any previous version.
Clearly storing each result as the text entry is woefully inefficient, and my results file would contain something like 1.8 GB of ASCII text or UTF-8 Unicode. The logical approach is to use codes to represent each of those four possibilities. If I did that using 64-bit integers, the size of the file would fall to 800 MB, and skimping further by using single bytes gets it down to 100 MB.
Rather than using whole bytes, I can get even greater efficiency by using individual bits, which would squeeze it down to 3 bits per entry, that’s less than 38 MB in total, using a fixed length code. As you’d expect, numbers of those in each category vary widely, with macOS 12 outnumbering all the rest, and the smallest number running previous versions. I could design a variable length code which ensures a minimum average length by assigning the shortest code to the most popular category. That turns out to be:
- 0 = macOS 12
- 10 = macOS 11
- 110 = macOS 10.15
- 111 = other
These codes can be decoded uniquely using simple rules:
- Read the first bit. If it’s 0, then it represents macOS 12.
- If the first bit is 1, read the next bit. If that’s 0, then it represents macOS 11.
- If the second bit is 1, read the third bit. If that’s 0, then it represents macOS 10.15. If that’s 1, then it’s ‘other’.
- Repeat until all the bits have been decoded.
This turns out to be most efficient given the frequencies of responses, as the most popular response is coded with the fewest bits, and so on.
When we compress the file using this code, we need to:
- build a dictionary containing all possible values,
- work out their relative frequencies,
- assign codes to each, so that the most frequent are assigned the shortest code,
- convert the data into code.
In this case, the minimum compressed file size would be a little more than 12.5 MB, which is less than 1% of the size of the original text file, and a third of the shortest fixed length code requiring 3 bits per entry. The compressed file then consists simply of the code dictionary, followed by the 100,000,000 codes for the data.
What’s most challenging here is working out how to allocate codes to the data. If the algorithm were to get that wrong, the result would be no better than using fixed length codes. It was a PhD student at MIT named David A Huffman who discovered the encoding algorithm in 1951, hence this scheme is known as Huffman code.
The approach used here is also the basis of other compression algorithms. Abraham Lempel and Jacob Ziv published papers describing two algorithms named after them in 1977 and 1978. These also build a dictionary of codes, but instead of those codes being constructed for fixed amounts of uncompressed data, they use a sliding window so that each code can represent longer sequences of uncompressed data.
A popular variant of the Lempel-Ziv algorithm was published by Terry Welch in 1984, and is known as Lempel-Ziv-Welch or LZW. This is adapted to make the LZ method simpler so that it can be implemented in hardware for speed.
Apple released its favourite algorithm, known as Lempel-Ziv Finite State Entropy or LZFSE, as open source in 2015. This combines the Lempel-Ziv approach with an encoding method published by Jarosław (Jarek) Duda in 2013 instead of Huffman code. This is intended to strike the best compromise between the amount of compression and the speed of decompression, and is the recommended method in Apple Archive.
Codes are a wide and important field in computing, encompassing compression, error detection and correction, and other fundamental functions. Seven years ago, in this article, I showed how new and improved versions of Morse code could be developed using information and coding theory. Next week I’ll look at how codes can detect and correct errors.
Stefan M Moser and Po-Ning Chen (2012) A Student’s Guide to Coding and Information Theory, Cambridge UP. ISBN 978 1 107 01583 8.
David JC MacKay (2003) Information Theory, Inference, and Learning Algorithms, Cambridge UP. ISBN 978 0 521 64298 9.
Wikipedia has many excellent articles explaining individual lossless compression algorithms.