hoakley April 3, 2020 Macs, Technology

File Integrity 1 : Why bother?

Over coming articles, I’m going to consider a subject which most of us just take for granted, that the contents of our files can be relied upon – file integrity. It’s one of those things we know from bitter experience isn’t completely reliable, but what more can you do than trust your Mac? This article starts by looking at what can go wrong, and what’s there to prevent it.

There are many different reasons that the contents of a file can become changed, including:

we (or a process acting on our behalf) can change it deliberately, by editing the file;
non-malicious software can change it accidentally, for example by writing to the wrong file or storage block;
the data stored can become altered as a result of failure or ‘bit rot’;
malicious software can change the data.

I’m concerned here with the latter three and their variations.

There was a time when it wasn’t uncommon for wobbly apps, often just as they were about to crash, wreaking destruction among files stored on disk. At that time, many apps used to write data using low-level commands for speed. Thankfully that’s now unusual, and accidental modification of files by other apps should be a rarity. But it can still happen, even with protections such as sandboxes.

Hard disks are well-known for developing errors and ‘bad blocks’ which can corrupt files, and regardless of some claims this remains true to a lesser extent in SSDs. Worst cases result in complete failure of the storage, and send you to your backups, but minor errors and ‘bit rot’ appear more common. All storage media become unreliable with use and time, although meaningful estimates of error rate are very hard to come by.

If you’ve ever tried accessing old DVD-R or CD-R storage, you’ll have come across examples where files can only be read with errors, or the whole disk is unreadable, even when it has been stored in good conditions in the dark.

One previously common cause of data corruption is failure to complete outstanding disk operations before a forced restart due to a kernel panic or other severe fault. File systems such as HFS+ are particularly prone to this because of the way that they write changes out to disk. Apple introduced journalling to tackle this, and that has been effective in reducing its occurrence but doesn’t eliminate it altogether. APFS was designed using the ‘copy on write’ principle which should make this a problem of the past, although in practice it can still occur very rarely.

The best-known examples of malicious software modifying user files are, of course, in ransomware. Such Wholesale encryption of files is quite a different issue, but several malicious apps and PUPs have also corrupted user files, and may do so unintentionally.

Overall, files kept on recent storage systems in modern computers are still prone to damage and corruption, although they should be less of a problem than they have been in the past.

Storage manufacturers now try to reduce the chances of files from becoming corrupted or damaged, for instance using error-correcting codes (ECC) in their products. There’s a conflict here in that ECC requires additional storage, effectively reducing that available to the user, and increases its cost per GB. Storage is a price-sensitive market, and few purchasers are prepared to pay 25% more or get 25% less capacity just to have good ECC cover. Its benefits are also not readily visible to the user, while the additional processing required during writing can impair performance.

RAID systems are widely used to safeguard data integrity. The most fault-tolerant, level 6, usually uses ECC, but is far from efficient: four 1 TB disks used at this level only provide a total 2 TB of effective storage capacity, making it particularly expensive when implemented using SSDs. Write performance is also significantly slowed, even when implemented in hardware.

Error-correction can also be incorporated into the file system, as is the case with Btrfs (Linux) and ZFS (cross-platform). This involves a process of ‘data scrubbing’ which scans the file system detecting errors and trying to repair them. Although OpenZFS is available for macOS and compatible with Catalina, installation and use are non-trivial and only feasible for advanced users.

Neither HFS+ nor APFS attempt any form of error correction on stored data, nor has Apple announced its intention that APFS will ever do so.

If error correction isn’t readily available in macOS, the next best thing is to be able to check the integrity of important files. This should enable you to replace a damaged copy of a file from backup or archive.

Alternatives to testing integrity aren’t particularly helpful. You could, for example, check the file modification date of all important files, assuming that you know for each file exactly what that should be. In any case, that will only reveal files which have been modified through the file system. Bit rot and similar damage doesn’t alter the modification date, just the data. For some, opening and checking the document in its normal editor/viewer is sufficient validation, but that’s only true if you can compare the current version against an earlier copy.

Neither HFS+ nor APFS perform any integrity checking on the data stored in regular files, nor has Apple announced its intention that APFS will ever do so. APFS does currently perform limited integrity checking on certain of the file system metadata, but that’s as far as it goes. This may seem a major shortcoming in a new file system, but you’ve got to remember that APFS, unlike ZFS, isn’t primarily designed for large server systems, but has to scale down to the Apple Watch and AppleTV: you’d hardly want your Watch to stop telling the time for an hour while it performs a full scrub.

If error-correction isn’t feasible for most Mac users, and there’s no system support for integrity checking of files, the only way to address these issues is using third-party products. That’s what I’ll be looking at in the next article in this series, and where my utility Dintch comes in.

17Comments

Add yours

1

Duncan on April 3, 2020 at 1:05 pm

From the article:

“few purchasers are prepared to pay 25% more or get 25% less capacity just to have good ECC cover. Its benefits are also not readily visible to the user, while the additional processing required during writing can impair performance.”

While there is certainly a cost for implementing ECC, there is little cost for a file system provider or end user to build in the *provisions* for doing so. Apple builds in RAID capabilities (albeit only for RAIDs 0, 1, and 10) and yet just because very few users buy the extra drives and configure it doesn’t mean that the capability should thus be ripped out. As an analogy, automobiles come with storage trunks (boots) that in many cases are left empty while people drive around, because carrying extra material increases fuel costs, and yet they are still available when desired. It would be ludicrous to suggest that this added storage capacity should not be part of the car’s design simply because it doesn’t get used most of the time.

Some of us are willing to buy extra storage to gain the benefits that RAID offers, and storage costs (even for SSDs) have dropped in price such that what was prohibitively expensive ten years ago can now be picked up almost as an impulse buy. So if ECC was built into APFS I’m certain many people would pay the extra costs on their own to gain that functionality. Perhaps it’s only a small percentage of Apple’s customers now (similar to the percentage who buy RAID setups) but *in the future* the cost for doing so might be so low it could even be implemented seamlessly on iOS devices. (For perspective, keep in mind that somewhere around the iPhone 7 its computational capability matched that of a CRAY-1; they are by no means feeble devices.)

(I’ll split my reply across two comments to keep this from getting too long.

LikeLiked by 1 person
2

Duncan on April 3, 2020 at 1:29 pm

Regarding the computational cost of implementing ECC, or even just file checksums, I find it hard to believe that this would be much of a challenge if done in hardware. We’ve got a T2 chip that can simultaneously encrypt and decrypt every piece of data passing through storage, while also performing certain media decompression routines. Current CPUs can perform RAID calculations (via SoftRAID) without the aid of dedicated hardware and that doesn’t bring even decade-old Macs to their knees while doing so. (I run it on a 2011 iMac file server with no problems.) So having a hardware solution for checksums and ECC calculations sounds like a no-brainer, if you’ll pardon the pun.

I’m reminded of the scene in ‘Dr. Strangelove’ where an exasperated Group Captain Mandrake is instructing Colonel Bat Guano to open the lock on the Coke machine to get coins for the pay telephone, and he exclaims, “Shoot it off! Shoot… with the gun! That’s what the bullets are for!” Well here we are discussing a strictly-computational task for the sake of guarding our data, and if a ‘computer’ isn’t specifically suited for that task I don’t know what is. That’s what the processors are for!

But again, just like the cost of additional storage to implement RAID, ECC features could be entirely *optional* so that lesser devices don’t have to do it if not desired. You’re correct that an Apple Watch owner has very little reason to concern themselves over their stored data, but if APFS had built in the capability but left it dormant on less-powerful devices I don’t think anyone would complain. The problem is that APFS *doesn’t* have this capability at all (that we know of), so if a cheap hardware solution arrives in the future APFS will either have to be revised (with possibly backward-compatibility problems) or this functionality will still need to be provided by third parties because of Apple’s short-sightedness.

LikeLiked by 1 person
3

Duncan on April 3, 2020 at 3:09 pm

Upon reviewing my above comments, I don’t wish for them to be percieved as an argument with Howard, or to expect any sort of rationalization on Apple’s behalf. I posted that out of frustration over the limitations of APFS, a brand new, from the ground up file system that I have been awiting for over a decade. My apologies if my passion got the better of me.

I look forward to Howard’s series of articles will try to refrain from long-winded rants.

LikeLiked by 1 person
- 4
  
  hoakley on April 3, 2020 at 3:34 pm
  
  Duncan,
  Thank you – I appreciate the spirit in which you comment. But we must always remember that the primary requirements for APFS are for the vast majority of its users, iOS. It’s not that Apple tacked on little bits to support macOS, but that APFS is primarily a file system for mobile devices with small memory and limited storage. ZFS was designed primarily for servers and networked systems, and is complicated for the smaller system user. I haven’t, for instance, seen any fully-GUI front end for its management, which would surely be mandatory for macOS.
  Another important factor to bear in mind is that Apple could do whatever it wanted (within reason) with the T2 chip and its successors, but they only act as disk controllers for internal storage, and there only with SSDs. Apple and many other vendors treat SSDs as if they never have errors or file system corruption, which is overoptimistic of course. But the need for ECC and/or integrity checking is more with external storage, which includes hard disks above all, and isn’t something which the T2 chip can help with.
  Hardware RAID systems still aren’t cheap, and using SSDs remain relatively expensive. My own very recent experience is that even a software RAID system will cost around £/€/$ 1500 for 4 x 2 TB SSD, which is around the same price as many new Macs. I don’t recall seeing a TB3 hardware RAID enclosure which offers additional ECC at any price, either.
  Maybe in the future, this will all change, and Apple may even introduce ECC into APFS. But what I’m more concerned with here is what we can do today. Just at the moment, that seems quite enough!
  I join you in the campaign to introduce these features in a future version of APFS. I do think they’re important, and the file system is where they need to be implemented in the first instance.
  Howard.
  
  LikeLike
  - 5
    
    Duncan on April 3, 2020 at 6:15 pm
    
    Thank you for your reply.
    
    I guess to summarize my lament I wish that Apple had made APFS ‘ECC-ready’, or at least ‘checksum-ready’ with a built-in spot for the checksum data. That way you and other developers wouldn’t need to jump through hoops to implement the xattrs or .ic file hacks to manage that.
    
    I’m also curious how CCC implements the ‘Find and replace corrupted files’ option. I believe that involves checksums as well, and certainly adds a lot more time to a clone operation (I have a separate task that performs that once a week early on Sunday morning). Does it perform the hash on every file each time it’s run, and then throw that data away afterwards? Ideally it would store the checksums for unchanged files (which probably constitutes the bulk of anyone’s main storage) somewhere on the clone so it can reuse it as appropriate.
    
    (I asked that a while ago from Bombich Software but never got an answer. I don’t know if you happen to have any more insights into that.)
    
    LikeLiked by 1 person
    - 6
      
      hoakley on April 3, 2020 at 7:03 pm
      
      Sorry, I don’t know how CCC does that. Maybe Mike Bombich will respond if he is passing?
      Howard.
      
      LikeLike
  - 7
    
    Sebastian on April 5, 2020 at 2:01 pm
    
    (I hope this comment shows up in the right place; on the website, the “Reply” link only appears for top-level comments.)
    
    The details of CCC’s “Backup Health Check” mechanism are explained in its online documentation (search for “corrupted” on that page and you’ll jump to the pertinent section). It is pretty much as Duncan suspects:
    
    “With this option, CCC will calculate an MD5 checksum of every file on the source and every corresponding file on the destination. If the checksums differ ,CCC will recopy the file.”
    
    That is probably the least elegant and least efficient way to verify backup integrity and it doesn’t cover source file integrity at all (save for outright read errors), but I guess it’s the only method you can sensibly incorporate into the backup process. Anything more would require writing to either the source or the backup in order to store hashes permanently, and that would be a big no-no for a backup tool, the purpose of which is, after all, to take the source exactly as it is and (ideally) construct a perfect copy of it. Unnecessarily adding data or even introducing asymmetries between backup and original would quite simply mean failure as a backup tool.
    
    Also, I presume that anything more sophisticated would basically mean turning CCC into a fully-fledged file integrity checker and require a rethinking of many of its core operational principles. And while time-consuming and not technically optimal, this option should nevertheless cover most instances of data corruption on the backup. So for users who don’t want to dive into specialized integrity checking tools, using this “health check” seems perfectly sensible (at least when only invoked weekly or monthly, as Mike recommends).
    
    LikeLiked by 1 person
    - 8
      
      hoakley on April 5, 2020 at 4:38 pm
      
      Thanks, Sebastian.
      (Sorry – there’s a limit to the depth of replies.)
      Howard.
      
      LikeLike
9

Xavier Paredes on April 3, 2020 at 11:53 pm

You, dear sir, are one of the best writers—if not the best—I’ve ever come across. You can make a complicated subject easy to understand.

Please keep up the amazing work and thank you for sharing your knowledge.

Cheers from Houston.

LikeLiked by 1 person
- 10
  
  hoakley on April 4, 2020 at 5:51 am
  
  Thank you for your kind words.
  Howard.
  
  LikeLike
11

tempelmann@gmail.com on April 6, 2020 at 12:31 pm

BTW, HDs, and I suspect SSDs as well, internally have ECC (or something similar) for dealing with bit errors, so they already have their own error correction layer (and then relocate the data to one of a reserved set of “backup” sectors).

However, the problem is that they do not communicate this fact, i.e. that a certain sector has some problems that were corrected, to the operating system (OS). That’s probably because the (SCSI, ATAPI) protocol does not support this. If it did, it would allow the OS to take action.

But then again, macOS doesn’t even let the user know if the disk had encountered some hard read errors! Also, no warning if the SMART check reports serious issues (including a check that the number of reserved backup sectors is declining significantly). In this regard, Apple has always been very ignorant of potential disk failures. Even the new APFS format has not taken care of adding redundant information – all it could tell you is that your data is corrupt. I think Apple wants you to rely on backup disks solely, despite knowing that many people, especially casual users, don’t ever take care of that.

In case of a known or imminent disk failure, macOS could at least tell people that their disk starts to go bad, so that they can act accordingly (instead, if the Mac stops working or keeps losing files, someone with more of a clue might suggest to them to run First Aid – and only then they’ll know what’s happened). I guess marketing (and product support?) doesn’t like the idea of scaring people of potential issues.

Funny anecdote: About 20 years ago, during a Q&A with Apple engineers at a WWDC, I did suggest that Mac OS (before OS X) should suggest to users to better restart their Mac when a program had crashed, because any program back then could corrupt the entire memory, including the operating system’s. So, to avoid further damage, a reboot would be the safest way to keep the Mac stable. Engineers agreed an added this warning to later Mac OS versions indeed! Much later I read an article where it was said that Steve Jobs hated that message (meaning he made an explicit point about it). Go figure.

And just the other day, when I wanted to help a friend to upgrade her MacBook’s HD to an SSD, it turned out that the HD had some read errors (which, of course, prevented me from easily cloning the disk to the SSD, but I managed) – and there had been zero indication of that before (to her, the Mac sometimes got stuck – that was all she noticed, and that’s why she thought she needed a new computer; because this one “got too slow”).

So yes, re-scanning your files occasionally may indeed make sense, to make sure it’s all still intact. But htat’s just a cludge to deal with the carelessness in macOS that Apple has left us with. And my friend, being a rather uninformed Mac user, would never install such a useful tool because she’d never learn about it.

Still, a few more suggestion to you, Howard:
1. See if you can use Spotlight to quickly find all files or folders that you’ve tagged, or keep a separate “database” in which you remember them.
2. Have a mode or a button that automatically scans all those files, listing the ones that got modified without their modification date having changed – because those are the suspicious ones!

(Huh, I should probably turn this into my own blog post)

LikeLike
- 12
  
  hoakley on April 6, 2020 at 12:45 pm
  
  Thanks, Thomas.
  Yes, I think there are several ways in which integrity checks need to be presented. However, I want to start at the beginning and not rush too far ahead just yet.
  I’m still looking at whether it might be feasible, for instance, to add on ECC without going down to the file system. Yes, as you write, it is allegedly built into a lot of storage now. But is it, for instance, in iCloud? The more comprehensive the solution, the more useful it will be.
  Howard.
  
  LikeLike
- 13
  
  Joss on April 8, 2020 at 10:51 am
  
  As for (1): if I remember correctly, there will eventually be a dintch CLI, and then you can create a small shell script that writes the dintch XA to a file, plus a com.apple.metadata:_kMDItemUserTags XA with e.g. “Dintch” in the XML. Then you can list all of the dintched files with mdfind, or in the GUI with Spotlight or in the Finder sidebar. Though I wonder if you can add the “#S” suffix to a user tag, i.e. write the additional XA as “com.apple.metadata:_kMDItemUserTags#S”.
  
  As for (2): that would mean that the modification date of the file needs to be written to an XA, too. Then Dintch could compare the current mod date with the XA-stored mod date, and if they match, but the hash is different, it would be a suspicious file. If the mod date is written to the Dintch XA, then it would be a feature update for the Dintch software, but you could also write it to a third XA, e.g. a proprietary plain text XA, e.g. “co.eclecticlight.dintch.lastModified#S”. Then you could compare the mod dates in your own shell script.
  
  LikeLiked by 1 person
  - 14
    
    hoakley on April 8, 2020 at 10:57 am
    
    Thank you, Joss.
    As of this morning, I have on the stocks for development from next week (I have an article for MacFormat to write first):
    – Dintch 1.1 which can optionally write the timestamp of the digest as another xattr
    – Fintch which will be a drag-and-drop utility to tag individual files rather than whole folders
    – cintch the command tool; I haven’t yet worked out it man page or options though
    – and I’m now looking at implementing full ECC for files as a separate project.
    I hope you like the names!
    Howard.
    
    LikeLike
  - 15
    
    hoakley on April 8, 2020 at 10:59 am
    
    Oh – regarding using #S flags on standard xattrs, yes you can, but they don’t always override the system defaults, so I need to play around with them first. Sometimes adding a flag seems to do the opposite, which is confusing to say the least.
    Howard.
    
    LikeLike
  - 16
    
    Joss on April 8, 2020 at 11:16 am
    
    Thank you for the info; nice to see new functionality on the horizon.
    
    Question: wouldn’t it be easier to write it all into a single XA, as e.g. macOS does with com.apple.quarantine? The XA contents would then be (for example) “$hash;$process;$date”, with hash being the file hash, process being Dintch || Fintch || cintch, and date being the posix date of either last modified or last digest, depending on what you plan to use (either in plain text or hexed). You would then only need to read one XA, and parse it within Dintch/Fintch/cintch, and since the hash is always in first place, it would also be suited for situations where the user doesn’t choose to write the timestamp, i.e. if you parse column 3, the date would be (null), but it would still be fine. (But I can’t say which is faster: reading from two XAs, or reading from one XA and parsing it; I assume that writing one XA is faster than writing two.)
    
    Another question: is there a reason why Dintch doesn’t write its XA in plaintext?
    
    LikeLiked by 1 person
    - 17
      
      hoakley on April 8, 2020 at 11:59 am
      
      Speed and compactness are the primary reasons for not doing this.
      If each time you read or write a xattr, you have to convert to/from a UTF-8 string and parse the contents, tagging or checking a million files is going to take a long time. As the timestamp is an option, it’s good to put that in a separate text xattr, which will then make tagging and checking a tad slower. But to inflict that same slowdown on all users isn’t good.
      The digest returns from Common Crypto or CryptoKit as 32 bytes of binary. That’s exactly what’s stored in the xattr without any further processing. Turn it into UTF-8 and it takes significantly longer, and requires more storage space. It’s also easier to tamper with or replace.
      Howard.
      
      LikeLike

·Comments are closed.

Share this:

Related