S.M.A.R.T.ypants

S.M.A.R.T. – which generally seems to appear so punctuated – stands for Self-Monitoring, Analysis and Reporting Technology, and is built into the firmware of most modern hard drives in an effort to try to detect imminent failure before it happens.

The theory goes that the majority of drive failures should occur as a result of wear and tear. If you then monitor aspects of the disk which deteriorate as a result of the same wear and tear, then as those variables start looking ominous, you will be able to anticipate more significant failure. This is a bit like monitoring your body temperature when you think you might have flu coming, in the expectation that a rise in temperature will precede the infection becoming more manifest, and will allow you to turn the alarm off and go back to sleep instead of going to work.

S.M.A.R.T. was first introduced in the ATA standard just over ten years ago, and should thus be a feature of all modern SATA (serial) and PATA (parallel) hard drives. The snag is that when they are housed in external enclosures and connected via FireWire or USB, S.M.A.R.T. status may not be accessible by the attached computer. So in practice they are only usable with internal drives, and those mounted in more sophisticated systems such as external RAID, NAS, and SAN devices.

The most recent implementations of S.M.A.R.T. include disk self-repair; when the S.M.A.R.T. system detects sector errors, for example, it will try to repair them as it goes along. You may only be aware of this if you use an advanced tool to query the S.M.A.R.T. status of the drive.

The most basic information available is that shown in Disk Utility: whether the S.M.A.R.T. status is OK, or the drive appears to be failing. More advanced tools give you access to the performance history of the drive, and the set thresholds at which it will report that failure is expected.

Limitations

The real world performance of S.M.A.R.T. is not quite so impressive.

First, drive failures do not only occur as a result of wear and tear. A graph of drive failure rates with time shows a ‘U’ shape, with early failures which are not associated with wear or tear, and likely to appear out of the blue. Once that early high rate falls, very few drives fail until they approach the end of their reliable working life. That varies considerably between batches of drives, but for good manufacturers is usually safely beyond their warranty period. It is then that S.M.A.R.T. should prove most useful, if you do not replace your drives once their warranty runs out.

Second, studies of drive farms which are worked much harder than drives in regular computers, such as those at Google, show that over a third of drive failures occur without any S.M.A.R.T. warnings at all. When those warnings occurred, prior to drive failure, only some of the variables measured correlated well with drive failure, and even they did not predict the majority of drive failures. So it is all a game of chance, and S.M.A.R.T. may not be smart enough to overcome Sod’s Law.

Third, even if a drive’s S.M.A.R.T. status changes to warn of imminent failure of a drive, it is not always clear that your Mac will pass the message on before the drive fails. None of us sits with our eyes glued to S.M.A.R.T. status, so we are reliant on OS X giving us timely warning of the issue. In the one experience that I have of S.M.A.R.T. status changing to warn of failure, by the time that I had worked out what was going wrong, the drive had crashed completely and broken OS X with it.

Fourth, the converse situation is also true: some changes in S.M.A.R.T. status which are reported as being indicators of imminent drive failure are not reliable predictors. Less work has gone into this problem of ‘false positives’, but you could find yourself replacing a drive which will not actually fail for several hundred hours yet.

Advanced Monitoring

SMART Utility normally gives a quick overview of SMART variables for a drive.

You can of course install a utility which does monitor S.M.A.R.T. status more formally and alert you of any change. Products include:

smartmontools is a powerful free shell tool from here,
SMART Utility from here will cost you from $25,
SMARTReporter, which provides a friendly GUI wrapper for smartmontools and other utilities, is £3.99 from the App Store,
Micromat’s TechTool Pro and some other general utilities give access too.

SMARTReporter can give you each of the measured variables from your drive.

Important Status Codes

If you want to view all the measured variables, SMART Utility will satisfy the inner geek.

If you do want to get your head around detailed S.M.A.R.T. status information, here are some variables that are worth watching, given by their S.M.A.R.T. ID number:

5, the count of reallocated sectors. When there is a disk error, that sector is marked as reallocated; when the count of reallocated sectors starts to rise, it is a good indicator that the drive does not have long to live.
9, the total number of hours the drive has been running. Few drives should be expected to live much longer than 40,000 hours, although there are always exceptions.
10, the number of attempts required to spin up to speed. This is most important for laptops, and systems which put their drives to sleep, where a rising value is a good sign of imminent failure.
183, on some drives only, tells you the number of data blocks with uncorrectable errors. When this starts rising, it is a good indicator of imminent failure.
188, the count of operations which were aborted because the drive timed out. If this rises much above zero, failure is coming soon.
196, a count of attempts to transfer data from reallocated sectors. When this rises, failure is much more likely.
197, a count of pending operations to transfer data from reallocated sectors, is similar to 196 in significance.
198, a total count of uncorrectable errors when reading or writing a sector. This is similar to 196 in significance.
201, a count of off-track errors, is another indicator whose rise indicates coming failure.
230, the current state of operation according to the predicted life curve of the drive. This is one variable for which higher values are better, and a fall to a low value is a good indicator to replace the drive as soon as possible.

S.M.A.R.T. ID 231 gives the drive temperature, but in most Macs various critical temperatures can be monitored outside the S.M.A.R.T. system, by apps with suitable features.

A detailed list is given in the Wikipedia article here.

Conclusions

Where it is available, notably in internal hard drives, S.M.A.R.T. can warn you of impending drive failure. It is not completely reliable, but in most cases a warning that the S.M.A.R.T. status of a drive has changed to that of imminent failure should prompt you to take immediate action to safeguard the data on that drive, and to replace it.

However S.M.A.R.T. status is not reliable enough to be an accurate predictor, and has many ‘false negatives’ where a drive fails without any warning, and ‘false positives’ where failure is predicted but does not occur for some time. In most cases, it is better to replace drives as soon as their warranty expires, rather than watching their S.M.A.R.T. status obsessively.

Share this:

Related