hoakley August 4, 2018 Macs, Technology

How to check that a file really is a faithful copy: Take 2, archives

A week ago, I explained how to compare two files, to see if they were in every respect the same, in particular whether their extended attributes were identical. Although that article stands, I didn’t consider a special situation: when the files in question are archives, compressed or not, which contain other files. Just for good measure, they could be encrypted too, if you want.

To understand how to tackle this, you need to understand what happens when an archive is made.

Normal ‘native’ files consist of three main types of data:

the attributes, including basic information about the file such as its name, date of creation, and other admin information about it;
the extended attributes, or xattrs, which are not stored with the file, but in the file system metadata of that volume;
the data (‘data fork’), which is the content that you normally get to see, such as the text content of a text file.

xattrArchives

When they are turned into an archive, which might be compressed and/or encrypted, those three parts are flattened together. The attributes are mostly put into a directory section of the archive file, and the extended attributes and data are then put together into the data (‘data fork’) of the archive file.

I don’t know of any archive format which keeps the extended attributes as xattrs. This makes it simple to preserve the extended attributes even when the archive may be stored on a file system or storage which doesn’t understand xattrs at all. Only when you unarchive the file is the data stored from the file’s xattrs turned back into extended attributes, and tucked away in the file system metadata.

Given that, how can we reliably compare two archives, to tell whether their contents are identical?

The processes of archiving and unarchiving involve one-to-one mappings. Using the same method of archiving, there is only one archive which can be made from each file, and each archived file can only unarchive into one file. So comparing the two archives using a binary comparison tool such as cmp will reveal whether their contents are identical, so long as the archives were built in an identical manner. So the first and simplest test is to compare the data fork of the two archives:
cmp archive1.sitx archive2.sitx

The snag with this is that it is usually possible to build two or more archives which contain exactly the same files, only in a different order. Although those would then actually contain identical files, simple comparison of the archived files would incorrectly report them as being different. This could also occur if the same files were archived using different versions of the archiving tool, which write the archive using a slightly different structure.

The next step, then, if the initial comparison suggests that their data forks may be different, is to unarchive their contents into a temporary folder, and compare those contents one file at a time. If you’re wanting truly identical archives, then an easy approach here is to step through each file in archive A, find whether there is an identically-named item in archive B, then compare those two with respect to their data and xattr contents.

Keep a tally count of the number of files checked in archive A, and at the end ensure that matches the total number of files in archive B. If you don’t do that, you won’t discover when archive B contains files which are not in archive A at all.

I can think of three additional twists which can make this more complex still.

Do you want to know details of the differences, or is it sufficient to know whether they are identical, or different? Working through detailed comparisons to detect and report all the differences can be quite demanding, and you could end up with a report listing thousands of differences, which get messy to handle.

How close do you want these files to match? You may want to build in tolerance for differences in attributes, such as different creation/modification dates, or even different names, provided that the data forks and xattrs match identically. Flexibility here comes at the cost of complexity.

Some archive formats store files using a standard flattened format, so you could simply extract that flattened format (in which xattrs are effectively part of the file’s data) and compare that as a chunk of data. That could be much quicker than unarchiving both the archives, and would probably be the preferred ‘professional’ solution.

If you simply want to know whether the archives contain identical files or not, it shouldn’t be hard to write an Automator workflow, AppleScript, or shell script based on cmp and cmpxat which will do this very well.

8Comments

Add yours

1

EcleX on August 9, 2018 at 6:13 am

Thanks for the great article. Where does fit MD5 there? I use File Compare to compare MD5 of two files. How does cmpxat1 compare to File Compare (better or worse)?
https://www.macupdate.com/app/mac/21455/file-compare
http://www.softhing.com/filecompare/info.html

LikeLiked by 1 person
- 2
  
  hoakley on August 9, 2018 at 6:17 am
  
  File Compare only looks at the data part (‘data fork’), and ignores the extended attributes, as far as I know. cmpxat only looks at the extended attributes, not the data fork. cmpxat doesn’t use MD5, which is best for large data forks. Almost all xattrs are small in size, and there is no benefit to using MD5 in those circumstances. Instead, cmpxat actually compares the contents of the xattrs directly.
  I hope that is clear.
  Howard.
  
  LikeLike
3

EcleX on August 9, 2018 at 2:43 pm

Thanks. What I am looking for is an application to determine if two files (.docx, .pdf, etc) are or not the same. For instance, when you receive one by eMail, and later on a similar message with the file again. Is that a new version of the file? The same than previously received? When for some reason I send a message twice (transmission error, new version, etc), I specify that in the message (and try not to send attachments twice or more), but some people do not do it. So, you end up receiving what seems the same file repeated, but the question is if it is really the same file or a new version, as said. That is the point.

LikeLike
- 4
  
  hoakley on August 9, 2018 at 2:52 pm
  
  The solution depends on which information in the files you want to compare.
  If you want to compare just the data forks, then the bundled command line tool cmp is ideal, and there are GUI apps which will do the same.
  If you want to compare the extended attributes, then cmpxat (command tool) or xattred (GUI) will do that.
  If you want to compare both, then use cmp and cmpxat, for example.
  Howard.
  
  LikeLike
5

EcleX on August 9, 2018 at 9:43 pm

I just want to know if the contents of the file are different. I mean the contents created by the user. For instance, in a “.docx” or “.pdf” file, you create file 1. Then edit the file and create file 2. They will have different MD5. That is what I want to know. As far as I know, there is not a Mac GUI application to drag & drop several files (or a folder of files) and compare MD5 of all such files with all to identify if there are identical ones. Such a tool would be great.

LikeLike
- 6
  
  hoakley on August 9, 2018 at 9:54 pm
  
  I thought that is what you told me that File Compare does: compare MD5s of the data fork.
  It’s not hard to write an Automator workflow or AppleScript tool to do this using cmp either, although I think that cmp only compares two files at a time.
  These two articles have been about comparing files which have extended attributes, and files when they are in a Zip or similar archive, which are completely different problems, and rather harder to solve, as I have explained above.
  Howard.
  
  LikeLike
- 7
  
  JP on September 10, 2018 at 4:09 am
  
  Actually, there is!
  
  It’s called QuickHash GUI (https://quickhash-gui.org). I’ve been using it for a while now and find it very useful for confirming error-free transfers of files to and from various storage media. It will calculate MD5, SHA-1, SHA256, SHA512, and xxHash32 values for one or multiple files (you can export .csv lists of files and their hashes), and allows you to compare two files or two folders (which I find super useful), or even whole disks. And it has drag & drop functionality. 🙂
  
  It is open source and cross-platform. My only reservations in recommending the Mac version: The app is currently (as of ver. 3.0.2) still 32-bit, and installation can be a little tricky (it requires a Terminal command, but the installation instructions are clear) because the app is not code signed (however, you can get a signed version for a mere £1.99). Also, it can be a bit buggy. For example, it will crash if I uncheck “Log Results” when comparing two folders.
  
  Another quirk is that it will compare hidden files in folders (with no way of disabling this currently). So, for example, the ubiquitous .ds_store files that macOS generates can trip it up when comparing otherwise identical folders. My solution is to delete them from each folder (briefly showing hidden files in Finder works: CMD + ⇧ SHIFT + . [period]) and re-scan. Just something to be aware of.
  
  I’m sure it will continue to improve over time. But it’s the only program of its kind on the Mac of which I’m aware, so why not give it try!
  
  LikeLike
  - 8
    
    hoakley on September 10, 2018 at 7:04 am
    
    Thank you.
    I have an app named Hash which seems to work rather better than that – it’s available in the App Store – but it also doesn’t fully solve the problems. There are also issues over selecting appropriate hashes/checksums: I think that the newer and larger SHA hashes are now recommended, but are not exactly convenient to use in these apps.
    Howard.
    
    LikeLike

·Comments are closed.

Share this:

Related