Explainer: Compression

The basic idea behind data compression is really simple. Look in one of the cupboards in your kitchen. If you’re neat and tidy and have space to spare, you might lay out the contents for ease of access. If you were paying rent for that space, you’d want to squeeze in as much as possible, so would go to some effort to pack it in efficiently. Storage space on your Mac doesn’t come free, and most of us like to minimise the size of downloads and email attachments. The answer then is to compress that data, to make best use of our limited storage or reduce the time to download.

Packing as much as possible into a cupboard has its downsides, just like compressing data. When you want something at the back, it takes longer, as you need to empty the cupboard to make room so you can find what you’re looking for. Accessing compressed data also brings overheads, and the more densely compressed your data, the longer it normally takes to decompress and recompress it.

There’s also a finite limit to the amount of packing of your cupboards, as there is to the compressing of data, without losing some of the contents. With some types of data, such as images, audio and movies, you can use lossy compression methods which reduce their quality to squeeze them down to even smaller sizes. You can’t do that with other data such as text, which would quickly be turned into gibberish.

Non-lossy

Non-lossy compression techniques are designed to perform as uniformly as possible across different types of uncompressed data, and thus come in general purpose compression tools. Stuffit, Zip, RAR, 7-Zip and others have all found favour at different times for different purposes.

Because decompression must result in perfect reconstitution of the original input data, there are limits to what can be achieved. For example, if a file consisted of a single 1 and 9999 zeros, run length encoding could compress that file very efficiently, coding it as <start of file><1 x 1><9999 x 0><end of file>. The best solution for files like that is to use an APFS sparse file rather than compressing it.

Applying non-lossy compression to a 24-bit deep 768 by 576 pixel raw image occupying 1.4 MB space can squeeze it down to 1.1 MB using Zip compression, or 996 KB using 7-Zip, for a best compression size of 71% of the original. Non-lossy LZW compression used in the same TIFF image works as well as Zip, to 1.1 MB for 79%. However smarter use of non-lossy techniques can do far better: a folder containing 25 identical copies of that same image, uncompressed size 35 MB, shrinks to 12.4 MB using Zip, and a remarkable 556 KB with 7-Zip, which is clearly optimising across as well as within files to achieve 1.6%.

One type of data which most non-lossy methods don’t do well with is that which is already compressed, as it’s closing into the mathematical limit of compression. File formats which are inherently wasteful in space like PDF and most XML-based formats are increasingly using compression as part of their format, so when you try compressing them further it’s largely a waste of time and effort. The only real benefit then is being able to move several files together in a single archive.

If you’re obsessed about achieving the most efficient compressed file sizes, the only way to work out which compression method and app to use is to assemble test folders containing example files that represent what you want to compress, and try different methods, timing how long each takes and recording the archive sizes. One tool that makes it easy to compare different compression methods is Keka, from the App Store, which is a popular utility in its own right.

In general, there’s a trade off between time taken to compress and decompress, and the size of compressed archives. The most efficient methods normally also require the most computing power, so take the longest, particularly when compressing. For most methods decompression is considerably quicker than compression.

If you’re looking for high performance compression, Apple Archive has been carefully tuned by Apple for use in its installers and updaters, and makes best use of hardware features of M1 series Macs. Although third-party support is currently limited, and confined to Big Sur and later, it’s one to watch.

Lossy

Lossy compression is inevitably tailored to the medium that is to be compressed.

Lossy audio compression builds on psycho-acoustic techniques in formats like MP3, developed by the Fraunhofer Institute from the 1980s onwards. Even the most discriminating human ear doesn’t perceive all the sounds present in audio tracks, so by simplifying the audio file, the amount of information that has to be encoded and compressed is reduced, yielding higher compression ratios.

Early implementations of MP3 were prone to audible distortion of percussive sounds, and still can sound sickly if Indonesian gamelan is encoded at low bit rates, for instance. Similar degradation of quality is readily seen in the pixellation of highly compressed JPEG images, and motion blurring in video compression. These illustrate the importance of control over lossy compressors, to determine the amount of degradation in the compressed output.

Lossy compression of still images is almost universal. Taking the same 1.4 MB raw image and saving it as a maximum quality JPEG reduces it to 596 KB, 43% of the original size, superior to all generic non-lossy compressors but without discernible reduction in image quality. High quality JPEG compression takes the file down to 352 KB (25%), moderate to 224 KB (16%), and low to 152 KB (11%), with increasingly obvious compression artefact.

To give you an idea of how closely tailored lossy compression methods are to their media, this is a rough outline of what happens when compressing an image using good old JPEG:

  1. changing its colour space to Y’CʙCʀ;
  2. reduction in the resolution of Cʙ and Cʀ channels, as we perceive fine colour details more poorly than we perceive brightness details;
  3. splitting the image into 8 x 8 blocks for frequency-based analysis using the Discrete Cosine Transform;
  4. reducing the high-frequency components in each block according to the quality setting of the user (0-100);
  5. lossless compression of block data using a form of Huffman encoding, in which the most common data is given the shortest encoding (as with text compression and Morse code).

Using a standard 500 x 500 image consisting entirely of identical red pixels, JPEG will compress the image without loss from its uncompressed size of 1 MB (750 KB for only three channels) to just 6 KB, no matter what the quality setting.

Most recently new methods such as those in HEIF/HEIC formats have improved considerably on JPEG, although they’re not yet as widely supported, so can’t normally be used on websites and the like.

Applying a further non-lossy compression step to a JPEG image is usually of little benefit: the 224 KB moderately compressed image shrinks to 136 KB using Zip, an overall compression of 10%, whilst 7-Zip performs even worse at 180 KB (13%). These illustrate the general rule that once well-compressed, trying further compression techniques results in diminishing returns.

The ultimate challenge for lossy compression is that of video, which is usually accompanied by compressed audio tracks. For with all but the lowest resolutions of video, lossy compression is required to be able to move the data to and from storage devices. In the early days, proprietary methods such as DivX were popular, but in more recent times a succession of increasingly sophisticated MPEG standards has replaced them.

Because lossy compression loses a little quality from the original each time compression occurs, it’s important to minimise the number of times each file is compressed. Even at higher qualities, content will progressively degrade and appear tired and overcompressed. That’s why it’s best to work with uncompressed media or use lossless compression until you’re ready to deliver the final version, in the only step that involves lossy compression.

Don’t forget decompression

Tucked away in a cupboard upstairs I have my original Macintosh IIfx, with a third-party expansion card providing hardware support for compression and decompression. At the time it was a huge advance, reducing the time required to less than 10% of what could be achieved in software alone. The snag is that its compression format is proprietary, and without that card I’m unable to decompress any of the files it so swiftly squeezed.

The most important factor to consider when compressing files for archives, or to send them to someone else, is whether they can be decompressed. No matter how efficient or quick a method might be, if the recipient of your email doesn’t know how to decompress your attachments, or in 20 years time you need to open those archives, then you might as well have encrypted them and thrown the password away.