Over the last couple of weeks, I’ve read more benchmarks and other performance measurements on SSDs than I’ve ever seen before, thanks to so many of you who have contributed results from your own tests. This article explains some of the difficulties in interpreting this avalanche of data, and how we can move forward.
Benchmark for what?
We run benchmark tests for different reasons. For some, it’s to prove that their purchase choice performs better than those of others. For those in the trade, it’s to show how fast their product is compared with their competitors. What I want to know, though, is how fast storage will be when in use, typically doing mundane tasks like reading and saving files, and when copying in the Finder.
That’s important, because some benchmarks use quite different code from that normally used by apps, and features in storage can also be tuned to deliver better benchmark results even though in real use they’re slower. So any benchmark which runs crafted code calling low down functions in C doesn’t tell me as much as that using standard FileHandle calls from Swift or Objective-C. And if the test doesn’t explain exactly what it does, we simply can’t trust what it’s doing.
Having decided what to benchmark, we then need to get its best estimate, either with small dispersion or known variance. In principle, all these tests should be deterministic, so with negligible noise or error, but in practice there are a great many other factors which can come into play, including:
- combinations of hardware, including case/enclosure, cable, host Mac
- different versions of macOS, kexts, firmware, etc.
- other software running which may access the disk during testing
- variation in negotiated bus/port speeds
- capacity of storage
- amount of free space on the storage
- availability of SLC Write cache on SSDs
- caching/buffering in memory, both on the Mac and in the storage
- onset of write cache exhaustion or thermal throttling
- unknown factors.
As in many other fields, you can design small experiments which control as many of those variables as possible in order to make comparisons, or you can pool data with high dispersion and analyse it statistically. What you can’t do is use individual data taken from the population to make the sort of comparisons you’d make in controlled experiments, because too many of those variables are uncontrolled.
This is a trap we all fall into at times. Suppose we see 99 reports of a particular SSD clocking up a write speed of 2.8 GB/s, and a single claim of 3.4 GB/s. Does that mean that those 99 reports were in error? Statistically, that singular result is an outlier. It might be an outright lie (some people do say strange things online), a simple mistake for what should have been 2.4 GB/s, or an unreliable test because of any of the factors above and below.
By far the most popular benchmarks used in macOS are those generously provided free by Blackmagic and Amorphous.
Blackmagic Disk Speed Test is primarily intended to help users of Blackmagic products determine whether their storage is capable of the write and read performance required to handle different types of video. Its analogue speedometer display is fun, but constantly changes during testing, leaving you guessing what the true transfer rates were. This makes its results subjective and susceptible to user interpretation.
AmorphousDiskMark is backed up by more information, for instance that its default sequential read/write queue depth is 8, test iterations are 5, test size is 1 GiB, test interval of 5 s, and a test duration limit of 5 s. For the most widely quoted figure of “SEQ1M QD8”, it’s described as “reading/writing the specified size file sequentially with 128 KiB blocks from the specified number of threads (queue depth).”
It’s most commonly used in its default configuration, with 5 test iterations of 1 GiB, for which it “shows the median score”. Although taking the median of five results may appear a wise statistical precaution, as the user is given no idea of the spread of results it’s easy to see how misleading that can be. For example, two sequences of test results would result in the same median:
- 2.5, 2.5, 1.5, 0.5, 0.5 GB/s
- 1.5, 1.6, 1.5, 1.4, 1.5 GB/s
In the first tests, the median of 1.5 GB/s appears to part of a sequence in transition from a high value of 2.5 to a low value of 0.5 GB/s, while in the second set, it appears to be a good estimate of tests with low dispersion. Without the original values or a measure of dispersion, the user isn’t aware of the nature of the results. That approach is sadly extremely common among benchmarks of all kinds.
A problem common to both these tests is their reliance on a single transfer size, typically in the range 1-5 GB, and the limited number of measurements performed. Transfer rates in the range 2-100 MB are often significantly lower, and more relevant to typical use cases, and some storage becomes significantly slower above 1 GB, equating to large media file sizes. Measuring five transfers totalling around 1 GB each is inadequate to provide any information about such broader performance.
Limits of credibility
One of the simplest ways to start validating results is to know the limits of credibility. If someone claims to be 3 metres tall, we know that either the number is wrong, or they’re nearly 30 centimetres taller than the tallest person ever. For each external bus, there’s a physical limit to the transfer that can be achieved between Mac and peripheral, although that can be controversial.
For example, Thunderbolt 3 and 4 are limited to an absolute maximum of four lanes of PCIe 3.0, which works out at slightly less than 4 GB/s. However, compliance with those standards for peripherals doesn’t require support for all four lanes, only two. This means that most TB3 external SSDs will be limited to less than 2 GB/s.
Vendors of TB3 SSD enclosures and complete drives, such as specialists OWC, are careful not to claim transfer rates over 2.8 GB/s for their TB3 products, and those supporting two rather than four PCIe lanes are generally claimed to deliver slightly less than 1.6 GB/s. Even if a Samsung 980 PRO SSD might be capable of a read speed of 7 GB/s when connected directly to an internal interface, that doesn’t mean that it can be any faster than those system limits for TB3.
Several of us are also painfully aware of the most common cause of anomalously high transfer rates: buffering and caching. Particularly when dealing with files smaller than 2 MB, macOS and storage devices go out of their way to improve their performance using fast memory. This is easy to spot when you have access to individual test results or estimates of dispersion or variance, when transfer rates exceed what’s physically possible, over 10 or even 20 GB/s. Without the use of statistical countermeasures far more robust than averaging or taking a median, these can readily poison most benchmarks.
SLC Write cache phenomena are an interesting issue here. They don’t involve a separate cache or buffer, but use part of the main SSD storage as accelerated but space-inefficient temporary storage. As they can’t normally be disabled, the principle of obtaining real-world measurements should apply here too. It’s useful to quantify the size of this cache on essentially empty storage, but in general there’s little benefit to deliberately exhausting write cache before measuring transfer rates.
Where we go from here
While it’s always interesting to hear of the claimed performance of external storage, particularly SSDs, amassing large quantities of uncontrolled noisy data isn’t likely to bring much progress. Whenever possible, we should use small, controlled experiments to reduce the unknowns and make common those factors we know affect results. Changing one factor at a time goes a long way to making meaningful conclusions possible.
Issues which will continue to be important to examine include:
- USB 3.x compliance and performance of TB4 ports.
- PCIe lane use by external storage devices; which if any enclosures can use all four lanes?
- Mac TB4 bus to port relationships; which ports are fed by which TB bus?
- Mac TB4 port interactions; which if any combinations perform poorly?
- Model- and firmware-specific performance problems.
These apply particularly but not exclusively to Apple Silicon models.
Where anecdotal observations can prove very useful is arousing suspicion of problems or high performance, even though they may contribute little to confirming or explaining them.