Understanding digital audio

Computers, particularly Macs, have revolutionised sound, music and audio. To achieve that, they have also transformed sound from an analogue world of instruments and sound waves into streams of digital data. This article helps you understand the fundamentals of digital audio.

Analogue sound, whether speech, music or ambient noise, consists of small pressure changes travelling as waves in the air. For the sake of example, I’ll start with the vibrating string of a musical instrument which is picked up by a microphone and transduced into an electric current, which varies according to a wave of the same pattern as in the air.

Before a computer can do anything with that electric current, it has to be converted from analogue form to digital, a stream of numbers, which is performed by an analogue to digital converter, or ADC, by means of rapidly sampling the input voltage.

Natural sound can range in frequency from almost nothing to medical ultrasound at several gigahertz in frequency. Thankfully, unlike some animals, humans have quite a limited range they can hear, typically from around 20 to 20,000 Hertz (20 Hz to 20 kHz). Ignoring any audio processing we might want to apply to the sound of our string instrument, to reproduce its sound faithfully we need to reconstruct its analogue waveform for audio output so that it’s identical to its input.

To achieve that, we must get two choices right: the sampling or conversion rate when converting from analogue to digital, and an appropriate precision for each of those sampled values. Get either of those wrong and the output will be distorted when compared to the original.

Sampling rate

Requirements for the sampling rate are the easier to determine, using the Nyquist-Shannon Theorem.

Just after the Second World War, Harry Nyquist and Claude Shannon discovered and proved that in order to reconstruct a waveform accurately you must sample the analogue waveform at a frequency of at least twice the maximum you need for reconstruction. So for normal human hearing, the lowest sampling rate which we can use is 2 x 20,000 = 40,000 Hz.

In practice, a little headroom is allowed, and the sampling frequency used for Audio CD was set at 44,100 Hz. More recently that has been replaced in most good quality digital audio systems by 48,000 Hz. But most sound and audio engineers prefer to work at twice that frequency, 96 kHz, before downsampling to 48 kHz for output.

It follows from the Nyquist-Shannon Theorem that digital audio sampled at 48 kHz is only capable of faithfully reproducing audio frequencies of up to 24 kHz. You couldn’t use that to record the ultrasonic sounds made by bats, or even many dog whistles. But is it sufficient to preserve the overtones and harmonics responsible for subtle properties such as timbre?

To understand that you need to understand how we hear sound.

Human ear

Our auditory transducers, the biological equivalent of ADCs, are hair cells contained in the Organ of Corti, in the cochlea, inside the inner ear. Depending on their position within the Organ of Corti, and the properties of the basilar membrane there, hair cells are highly sensitive to specific frequencies of sound. We know from abundant testing that few humans have hair cells capable of detecting sound at frequencies higher than 20 kHz.

Spectrum analysis

To see what frequencies might be involved in timbre, we need to resort to another eponymous analysis, that of Fourier. Using this, we can decompose the complex waveform generated by a musical instrument into a frequency spectrum, showing the frequency content of the sound.

Waveform from sample audio file, lossless compression. — Waveform from sample audio file.

This sample contains a wide range of frequencies, which make up a complex series of different sounds.

Frequency spectrum of sample audio file, FLAC lossless compression. — Frequency spectrum of sample audio file.

Its frequency spectrum shows almost no power in frequencies above 16 kHz, and none at all above 22 kHz. Almost all the frequencies are in the audible range for humans, and according to the Nyquist-Shannon Theorem they could be faithfully represented by sampling at a frequency of at least 44 KHz. Using a higher sampling frequency of 48 kHz won’t preserve the original signal any better. Equally, humans are unable to distinguish between sound output from this at 44, 48 or 96 kHz.

Sampled values

In the early days of digital audio, sampled values obtained from continuous waveforms were stored as integers, and higher quality conversion used larger integer types. Single byte integers only have 256 different values, so the error in using them to represent continuous values is great. Early schemes simply used larger integer types to reduce the errors introduced by discretisation into integers.

More recently, with the advent of more capable processors, digital sound data has changed to using floating-point types, and the standard used internally in the Mac’s Core Audio currently employs 32-bit floating-point numbers for precise representation of digitised waveforms.

The combination of a sampling frequency of 48 kHz and 32-bit floating-point values ensures high fidelity in audio input and output, ample for human hearing to be unable to detect any difference.

Oversampling

While 48 kHz sampling is more than adequate to preserve all audio frequencies which can be heard by humans, wouldn’t it be preferable to sample at 96 kHz or more, to get even better quality?

Returning to the Nyquist-Shannon Theorem, sampling at 96 kHz rather than 48 kHz would add audio frequencies of 24-48 kHz, which can be heard by domestic pets and many other animals. Unfortunately, unless you buy very special audio equipment, you’ll discover that microphones and sound input devices, as well as headphones and speakers, simply can’t handle frequencies above 20 kHz, and in many cases their performance is already dropping off above about 17 kHz.

The penalty of oversampling is that, whatever you do with your digital audio data, there’s twice as much of it at 96 kHz as there is at 48 kHz. That may not seem much, but many computer algorithms don’t run in direct proportion to the quantity of data. Among those are some of the most important in audio processing, such as Fourier transforms. While fast Fourier transforms (FFTs) do better than scaling in proportion to the square of the data size, doubling the quantity of data still imposes significantly greater demands on your Mac’s processor. There ain’t no such thing as a free lunch when it comes to digital audio processing.

Nevertheless, most sound and audio engineers still opt for higher sampling rates than are necessary to generate perfectly good sound output for human ears. That’s a pro choice; for the rest of us there’s nothing to gain for the additional cost.

Reference

Apple’s extensive documentation starts here.

14Comments

Add yours

1

Mark Saxon on March 24, 2022 at 7:53 am

Many thanks.

This has been a mystery to me since the advent of CDs.

LikeLiked by 1 person
- 2
  
  hoakley on March 24, 2022 at 12:24 pm
  
  Thank you.
  Howard.
  
  LikeLike
3

Javier Gallardo on March 24, 2022 at 9:56 am

Nice briefing. Science keeps on research, of course:
https://www.nationalgeographic.com/science/article/110516-people-hearing-aids-ears-science

LikeLiked by 1 person
- 4
  
  Javier Gallardo on March 24, 2022 at 10:23 am
  
  Apple’s Logic Pro can use up to 192kHz files, so I suppose it has some benefit for professional audio production. https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth.html
  Regarding macs, and settings to watch for us users, I think we can conclude that sample rate and bit depth should be checked if looking for best sound reproduction quality. That’s why MacOS offers to choose different settings. Not taking care of this when working with sound is not a disaster, though, because the ending part of a sound or an image file normally is presented in poorly calibrated screens, small speakers or detail-limited bluetooth earphones.
  
  LikeLiked by 1 person
  - 5
    
    hoakley on March 24, 2022 at 12:32 pm
    
    Thank you.
    The article you linked to is woefully out of date – for example, it refers to integer bit depth. macOS has been using 32-bit floating point for audio for quite a while now. With floating-point, of course, ‘bit depth’ is meaningless.
    I do refer to professional uses above. That’s a matter for professional judgement. It depends what processing is being performed with the audio – something the engineer needs to decide. As I’ve explained above, for the non-professional, it’s quite different.
    The same goes for pro video, where pros regularly work with Mac Pro systems at far higher resolutions that we’ll ever experience. As we don’t all have $6,000 displays, most of our Macs can’t even show such video.
    Howard.
    
    LikeLike
- 6
  
  hoakley on March 24, 2022 at 12:27 pm
  
  Thank you. Sadly, whoever wrote that article didn’t understand the science, and left us all very confused. Of course, we can already assess hearing via bone conduction, something performed daily on many patients during thorough audiological examination. Not only that, but sound can’t completely bypass the ear and go straight to the brain: there still needs to be a transducer, like the hair cells. And there’s not a shred of evidence that humans have hair cells that are sensitive to frequencies above those in the human audio spectrum.
  It will be interesting to see what, if anything, comes out of this. But it doesn’t change anything written above.
  Howard.
  
  LikeLike
7

BenSar on March 24, 2022 at 5:26 pm

Is there a relationship between sample rate and dynamic range? Or is that just another can of (indigestible) worms?

LikeLiked by 1 person
- 8
  
  hoakley on March 24, 2022 at 5:39 pm
  
  No – it’s an excellent question.
  Remember that sampling rate determines the frequencies that can be included in the digitisation step. Dynamic range is about their volume, which is determined by the numeric type used to represent each value in the sample – the louder the volume, the larger the number.
  Wikipedia informs me that 24-bit integers deliver 144 dB of dynamic range, which should cover pretty well every need. However, 32-bit floating point is even better than that, and is the de facto standard for professional use. And that’s what macOS uses internally.
  Howard.
  
  LikeLike
9

wscodymaccom on March 25, 2022 at 4:57 pm

Very good overview! You’ve offered some excellent information, especially addressing the maximum sample rate when recording audio. I now feel better about my decision to stick with 48kHz.

LikeLiked by 1 person
- 10
  
  hoakley on March 25, 2022 at 7:48 pm
  
  Thank you.
  Howard.
  
  LikeLike
11

Adam Bridge on March 27, 2022 at 2:27 am

It’s vital to remember that the audio input must be passed through a lowpass filter before digitizing. That’s a part of using the FFT that many people forget. Ideally that filter would be a literal cliff – infinite decades per octave. Without the filter aliasing of the frequencies above the design frequency will occur. Obviously there are no such filters but there are good low-pass filters. Using them is a Good Thing.

LikeLiked by 1 person
- 12
  
  hoakley on March 27, 2022 at 1:07 pm
  
  Thank you, but you’ve now confused me. Are you saying that every microphone or other analogue sound input device requires a separate analogue filter before their output is connected to be digitised? Digital filtering before performing FFTs is fairly standard as I recall, but few ordinary users ever perform FFTs.
  Howard.
  
  LikeLike
13

ifitdoesnt on April 5, 2022 at 3:47 pm

Harry Nyquist presented his work many years before Claude Shannon. The later not just discovered but proved the theorem. Somewhat similar ideas were published by various scientists in the first half of the 20th century. The Wikipedia article you refer to states that the first theoretically exact formulation of the sampling theorem was made by Vladimir Kotelnikov 16 years before Shannon.

LikeLiked by 1 person
- 14
  
  hoakley on April 5, 2022 at 8:51 pm
  
  Thank you.
  That’s precisely why I provided that link. And, as I’m sure you’ll know, there’s a lot more to it than those three. But that’s not what this article is about.
  Howard.
  
  LikeLike