Interpreting and using disk performance data

Over the last couple of weeks, I’ve read more benchmarks and other performance measurements on SSDs than I’ve ever seen before, thanks to so many of you who have contributed results from your own tests. This article explains some of the difficulties in interpreting this avalanche of data, and how we can move forward.

Benchmark for what?

We run benchmark tests for different reasons. For some, it’s to prove that their purchase choice performs better than those of others. For those in the trade, it’s to show how fast their product is compared with their competitors. What I want to know, though, is how fast storage will be when in use, typically doing mundane tasks like reading and saving files, and when copying in the Finder.

That’s important, because some benchmarks use quite different code from that normally used by apps, and features in storage can also be tuned to deliver better benchmark results even though in real use they’re slower. So any benchmark which runs crafted code calling low down functions in C doesn’t tell me as much as that using standard FileHandle calls from Swift or Objective-C. And if the test doesn’t explain exactly what it does, we simply can’t trust what it’s doing.

The goal

Having decided what to benchmark, we then need to get its best estimate, either with small dispersion or known variance. In principle, all these tests should be deterministic, so with negligible noise or error, but in practice there are a great many other factors which can come into play, including:

combinations of hardware, including case/enclosure, cable, host Mac
different versions of macOS, kexts, firmware, etc.
other software running which may access the disk during testing
variation in negotiated bus/port speeds
capacity of storage
amount of free space on the storage
availability of SLC Write cache on SSDs
caching/buffering in memory, both on the Mac and in the storage
onset of write cache exhaustion or thermal throttling
unknown factors.

As in many other fields, you can design small experiments which control as many of those variables as possible in order to make comparisons, or you can pool data with high dispersion and analyse it statistically. What you can’t do is use individual data taken from the population to make the sort of comparisons you’d make in controlled experiments, because too many of those variables are uncontrolled.

This is a trap we all fall into at times. Suppose we see 99 reports of a particular SSD clocking up a write speed of 2.8 GB/s, and a single claim of 3.4 GB/s. Does that mean that those 99 reports were in error? Statistically, that singular result is an outlier. It might be an outright lie (some people do say strange things online), a simple mistake for what should have been 2.4 GB/s, or an unreliable test because of any of the factors above and below.

Tests

By far the most popular benchmarks used in macOS are those generously provided free by Blackmagic and Amorphous.

Blackmagic Disk Speed Test is primarily intended to help users of Blackmagic products determine whether their storage is capable of the write and read performance required to handle different types of video. Its analogue speedometer display is fun, but constantly changes during testing, leaving you guessing what the true transfer rates were. This makes its results subjective and susceptible to user interpretation.

AmorphousDiskMark is backed up by more information, for instance that its default sequential read/write queue depth is 8, test iterations are 5, test size is 1 GiB, test interval of 5 s, and a test duration limit of 5 s. For the most widely quoted figure of “SEQ1M QD8”, it’s described as “reading/writing the specified size file sequentially with 128 KiB blocks from the specified number of threads (queue depth).”

It’s most commonly used in its default configuration, with 5 test iterations of 1 GiB, for which it “shows the median score”. Although taking the median of five results may appear a wise statistical precaution, as the user is given no idea of the spread of results it’s easy to see how misleading that can be. For example, two sequences of test results would result in the same median:

2.5, 2.5, 1.5, 0.5, 0.5 GB/s
1.5, 1.6, 1.5, 1.4, 1.5 GB/s

In the first tests, the median of 1.5 GB/s appears to part of a sequence in transition from a high value of 2.5 to a low value of 0.5 GB/s, while in the second set, it appears to be a good estimate of tests with low dispersion. Without the original values or a measure of dispersion, the user isn’t aware of the nature of the results. That approach is sadly extremely common among benchmarks of all kinds.

A problem common to both these tests is their reliance on a single transfer size, typically in the range 1-5 GB, and the limited number of measurements performed. Transfer rates in the range 2-100 MB are often significantly lower, and more relevant to typical use cases, and some storage becomes significantly slower above 1 GB, equating to large media file sizes. Measuring five transfers totalling around 1 GB each is inadequate to provide any information about such broader performance.

Limits of credibility

One of the simplest ways to start validating results is to know the limits of credibility. If someone claims to be 3 metres tall, we know that either the number is wrong, or they’re nearly 30 centimetres taller than the tallest person ever. For each external bus, there’s a physical limit to the transfer that can be achieved between Mac and peripheral, although that can be controversial.

For example, Thunderbolt 3 and 4 are limited to an absolute maximum of four lanes of PCIe 3.0, which works out at slightly less than 4 GB/s. However, compliance with those standards for peripherals doesn’t require support for all four lanes, only two. This means that most TB3 external SSDs will be limited to less than 2 GB/s.

Vendors of TB3 SSD enclosures and complete drives, such as specialists OWC, are careful not to claim transfer rates over 2.8 GB/s for their TB3 products, and those supporting two rather than four PCIe lanes are generally claimed to deliver slightly less than 1.6 GB/s. Even if a Samsung 980 PRO SSD might be capable of a read speed of 7 GB/s when connected directly to an internal interface, that doesn’t mean that it can be any faster than those system limits for TB3.

Several of us are also painfully aware of the most common cause of anomalously high transfer rates: buffering and caching. Particularly when dealing with files smaller than 2 MB, macOS and storage devices go out of their way to improve their performance using fast memory. This is easy to spot when you have access to individual test results or estimates of dispersion or variance, when transfer rates exceed what’s physically possible, over 10 or even 20 GB/s. Without the use of statistical countermeasures far more robust than averaging or taking a median, these can readily poison most benchmarks.

SLC Write cache phenomena are an interesting issue here. They don’t involve a separate cache or buffer, but use part of the main SSD storage as accelerated but space-inefficient temporary storage. As they can’t normally be disabled, the principle of obtaining real-world measurements should apply here too. It’s useful to quantify the size of this cache on essentially empty storage, but in general there’s little benefit to deliberately exhausting write cache before measuring transfer rates.

Where we go from here

While it’s always interesting to hear of the claimed performance of external storage, particularly SSDs, amassing large quantities of uncontrolled noisy data isn’t likely to bring much progress. Whenever possible, we should use small, controlled experiments to reduce the unknowns and make common those factors we know affect results. Changing one factor at a time goes a long way to making meaningful conclusions possible.

Issues which will continue to be important to examine include:

USB 3.x compliance and performance of TB4 ports.
PCIe lane use by external storage devices; which if any enclosures can use all four lanes?
Mac TB4 bus to port relationships; which ports are fed by which TB bus?
Mac TB4 port interactions; which if any combinations perform poorly?
Model- and firmware-specific performance problems.

These apply particularly but not exclusively to Apple Silicon models.

Where anecdotal observations can prove very useful is arousing suspicion of problems or high performance, even though they may contribute little to confirming or explaining them.

11Comments

Add yours

1

EcleX on May 6, 2022 at 9:21 am

Thanks for the interesting article. Also important can be the format of the disk (HFS+ or APFS) and if other processes are being executed, like MRT (typical after booting Mac and sometimes, also later on), Time Machine backups, Spotlight indexing, etc. MenuMeters and Activity Monitor can help to monitor that.

LikeLiked by 1 person
- 2
  
  hoakley on May 6, 2022 at 11:30 am
  
  Thank you.
  Interestingly, I have run tests on APFS and HFS+, and you may be surprised to learn that benchmark results on HFS+ appear slower than those on the same disk in APFS. However, as we’re largely concerned with SSDs, there’s little point in making that comparison any more. Who would ever want to run HFS+ on an SSD in Monterey?
  Yes, I do include other software running in my list of factors which can affect results. Among those are utilities like MenuMeters and Activity Monitor: if you want accurate and reproducible results, then you shouldn’t be running those either during tests.
  Howard.
  
  LikeLike
  - 3
    
    EcleX on May 6, 2022 at 3:00 pm
    
    Thanks. That is certainly surprising and good news as well. I agree that there is no point in running HFS+ now that APFS is mature enough, but some Mac news sites and developers are doing it and even recommending it. I can post very recent links of both of them.
    
    On the other hand, I agree that the best would be to run no other software, but the problem of not having a monitoring software like MenuMeters is that you are blind and do not see if other process taking high activity is executing when benchmarking.
    
    LikeLiked by 1 person
    - 4
      
      hoakley on May 6, 2022 at 6:56 pm
      
      So you run software to see whether any other software might be interfering?
      No – what you do is quit all open apps and as much user software as possible. That minimises risk, not increases it.
      Howard.
      
      LikeLike
  - 5
    
    G.J. Parker on May 6, 2022 at 4:52 pm
    
    I thought MenuMeters didn’t work for 10.11 and beyond? (https://www.ragingmenace.com/software/menumeters/#requirement)
    
    LikeLiked by 1 person
    - 6
      
      EcleX on May 6, 2022 at 5:47 pm
      
      Yes, but that is version 1.8.1. They opened the code and someone continued development. Current version is 2.1.6.1. I think that I cannot post links here, but you can find it searching for it in MacUpdate, which points to its current developer: Yuji Tachikawa. A must have application for all Mac users. And it is free. Do not be mislead by its outdated URL.
      
      LikeLiked by 1 person
    - 7
      
      hoakley on May 6, 2022 at 7:00 pm
      
      “I think that I cannot post links here”
      That’s not correct. You and many others have posted links in comments, which I welcome. If you read the comment to which you’re referring, you’ll see that there’s a link included in it.
      As I have explained, I encourage links where they’re helpful, relevant, within the law (copyright in particular), and not offensive to anyone else. I do check links, and reserve the right to remove those that don’t comply with that straightforward policy.
      Howard.
      
      LikeLike
    - 8
      
      EcleX on May 6, 2022 at 8:20 pm
      
      Thanks. Here is the link:
      
      MenuMeters for OS X El Capitan 10.11 and later
      https://member.ipmu.jp/yuji.tachikawa/MenuMetersElCapitan
      
      Check the preferences to choose the best configuration for your eyes. You may want to change the colors for better visualization if you have “Apple – System Preferences – General – Appearamce – Dark” selected.
      
      LikeLiked by 1 person
    - 9
      
      G.J. Parker on May 6, 2022 at 10:36 pm
      
      Thank you for the link. It’s even better the source is on GitHub. I’ll go roll my own. Didn’t really like the replacement I have been using (iStats).
      
      LikeLiked by 1 person
10

Marcos on May 6, 2022 at 9:17 pm

Thank you for this blog! A little bit off-topic, but does the difference in memory bandwidth between the M1 Max and M1 Pro make any difference in real life?

LikeLiked by 1 person
- 11
  
  hoakley on May 6, 2022 at 10:21 pm
  
  Thank you.
  I’m sorry, it’s not something that I have looked at. It’s difficult to design a test which could specifically measure this in real apps, I fear.
  Howard.
  
  LikeLiked by 1 person

Share this:

Related