Explainer: S.M.A.R.T. and disk health

It was another of those great ideas: build a monitoring system into the firmware of storage devices which could report signs of impending failure. First introduced in the ATA standard back in 1995 as Self-Monitoring, Analysis and Reporting Technology or S.M.A.R.T. (most accurately including the stops/periods), it has been widely adopted across many types of storage device. But it’s far from standardised, and in many cases not even accessible.

The underlying theory is that the majority of disk/drive failures should occur as a result of wear and tear. If you then monitor aspects of the device which deteriorate as a result of the same wear and tear, then as those variables start looking ominous, you’ll be able to anticipate more significant failure. This is a bit like monitoring your body temperature when you think you might have an infection coming, in the expectation that a rise in temperature will precede the infection becoming more manifest, and will allow you to turn the alarm off and go back to sleep instead of going to work.

Two big snags with S.M.A.R.T. are its lack of standardised indicators or attributes, and lack of support for reading them in macOS.

A glance at the Wikipedia article on S.M.A.R.T. shows how allowing each storage manufacturer to define their own set of attributes and their values has led to many attributes and confusion over how to interpret them. Many are still aimed at hard disks, although S.M.A.R.T. is widely used in SSDs too. Clearly, to make any sense of these, an effective monitoring app needs a lot of additional manufacturer-specific information.

Although macOS has supported storage connected via USB-A ports since 1998 and USB-C since 2015, it has never supported access to S.M.A.R.T. attributes on storage connected using those ports. FireWire support was rather better, and Thunderbolt should give S.M.A.R.T. access by default. This can be confirmed in Disk Utility, or in the Storage item in System Information, where the last entry for each supported physical drive gives its current S.M.A.R.T. status.

Apple’s simplistic entry for S.M.A.R.T. status doesn’t even report when this was last checked, but several excellent third-party utilities give more detailed access. My favourite substitute remains DriveDx, but there are others, and the free command tools in smartmontools which enable you to roll your own with modest effort.

However, for any third-party utility to be able to monitor storage connected by USB, the SAT SMART kernel extension has to be installed. Although this does apparently run on M1 series Macs as well as Intel models, macOS can’t run at Full Security on an M1 when it requires loading third-party kernel extensions. Users are therefore forced to make a choice: do they downgrade security and accept the risks of a third-party kernel extension so they can monitor storage connected via USB, or do they maintain full security and lose S.M.A.R.T. access?

Yet it’s the same Apple that wants us all to run Macs at full security and to eliminate third-party kernel extensions, but can’t find its way to building support for S.M.A.R.T. over USB into macOS.

Real-world performance of S.M.A.R.T. also isn’t always as impressive as we might wish.

Storage failures don’t only occur as a result of wear and tear. A graph of failure rates over time shows a ‘U’ shape, with early failures which aren’t associated with wear or tear, and likely to appear out of the blue. Once that early high rate falls, very few fail until they approach the end of their reliable working life. That varies considerably between batches, but for good manufacturers is usually safely beyond their warranty period. It’s then that S.M.A.R.T. should prove most useful, if you don’t replace your storage once its warranty runs out.

Studies of storage farms which are worked much harder than that in regular computers, such as those at Google, show that over a third of disk failures occur without any S.M.A.R.T. warnings at all. When those warnings occurred, prior to failure, only some of the variables measured correlated well with that failure, and even they didn’t predict the majority of failures. So it’s all a game of chance, and S.M.A.R.T. may not be smart enough to overcome Sod’s Law.

Even if S.M.A.R.T. status changes to warn of impending failure of storage, it isn’t always clear that your Mac will pass the message on before failure occurs. None of us sits with our eyes glued to S.M.A.R.T. status, so we’re reliant on macOS giving us timely warning of the issue. In the one experience that I have of S.M.A.R.T. status changing to warn of failure, by the time that I had worked out what was going wrong, the disk had crashed completely and broken macOS with it.

The converse situation is also true: some changes in S.M.A.R.T. status which are reported as indicators of imminent failure aren’t reliable predictors. Less work has gone into this problem of ‘false positives’, but you could find yourself replacing storage which won’t actually fail for several hundred hours yet.

Where it’s available, notably in internal storage, S.M.A.R.T. can warn you of impending failure. It isn’t completely reliable, but in most cases a warning that the S.M.A.R.T. status of storage has changed should prompt you to take immediate action to safeguard the data there, and to replace the storage.

However, S.M.A.R.T. status isn’t reliable enough to be an accurate predictor, and has many ‘false negatives’ where storage fails without any warning, and ‘false positives’ where failure is predicted but doesn’t occur for some time. In most cases, it’s better to replace storage as soon as its warranty expires, rather than watching its S.M.A.R.T. status obsessively.