Explainer: Unicode, normalization and APFS

One of the oldest problems with Apple’s APFS file system is how it encodes file and directory names using Unicode. This collides with one of the thorniest problems with Unicode, the fact that characters which appear identical can have two or more different code points (encodings). To see this in action, you’ll need a copy of my free app Apfelstrudel.

Open Apfelstrudel, place the cursor in its Input box and type on your keyboard the word Café, with an acute accent on the final e (press Option-e then the e key again to generate that). Then press Return.

normal01

In the next box down from Input, labelled HFS+, you’ll see that highlighted in red, and the word Café repeated. This is because, although those two renderings of the word appear identical, they use different Unicode code points. The version you typed in is in ‘normalised Form C’, while that used in the old HFS+ file system would be in ‘normalised Form D’.

Now try this simple test. Create a new folder and paste the Input version of Café as its name. Then create another new folder alongside it (in the same folder) and try pasting the HFS+ version as its name. The Finder won’t let you, as it considers that those two names are identical, just as they appear to be, even though they contain different Unicode code points, and even their lengths differ.

normal02

This happens in HFS+, which is a ‘normalising file system’ to avoid you becoming confused by two items which appear to you to have identical names. What happens in HFS+ is that, if you try to name an item using Form C, the file system automatically converts it to Form D. So although you may have provided two different names, HFS+ normalises them both to Form D, in which they really are identical.

Normalising file systems used to be quite popular, but by the time Apple came to design APFS, such behaviour was falling out of favour. In its original specification, APFS was billed as being non-normalising (the original phrase referred to filenames as just being a ‘bag of bytes’). That attracted a lot of criticsm from macOS developers, who could see the problems it would bring to users and apps alike. In spite of that, Apple’s engineers stuck to their guns.

When APFS was first released, chaos ensued, not so much with languages based on Roman alphabets, but worst of all in Korean, where there are a lot more normalisations. Among the biggest casualties were Apple’s own apps, so its engineers worked on a fix. Being confident still that APFS had made the right choice, they didn’t make APFS a normalising file system, but applied normalisation wherever it’s needed in macOS, and that’s what you see going on in my little demonstration above, and why you can’t have items using both normalised forms in the same folder.

For the great majority of users, this works fine. APFS doesn’t normalise, which makes it a bit more efficient, but a normalisation layer in macOS ensures that file and directory names are normalised, so APFS behaves just like HFS+ did. Except that they don’t always.

There are two potential problems which can appear, apparently out of the blue.

The first is with apps which generate their own filenames (and foldernames) from un-normalised Unicode text. Let’s say I have an app which creates and maintains its own image library, based on metadata stored with each image. If I as a user save a metadata field for an image using the Form C version of Café, which is the more likely as that’s what’s generated from the keyboard, and the app tries to use that as the filename, macOS should normalise that to Form D. If that app is unaware of the normalisation, metadata and filename end up being different, which can cause misunderstanding. Developers need to be aware of this and track the file path using the correct form, or even better using more independent mechanisms such as bookmarks.

More likely and more serious are the conflicts which can occur when using different methods of accessing non-Mac file systems. Thomas Tempelmann has demonstrated this using a share on a NAS, which he mounted first using NFS, then created a file with a name in Form C. When mounted via SMB on a Mac, as that filename is un-normalised, it can’t be accessed, as he has described here. Apple’s recommended solution is to mount NFS shares with the nfc option enabled, which should ensure normalisation is performed to the expected Form D. As ever, Michael Tsai has a succinct summary here.

There are all sorts of other ways that Unicode normalisation can trip apps up. Apfelstrudel shows some of them in its lower text view: Form C and D strings should compare correctly using Swift == and NSString compare() when caseInsensitive, but not with NSString isEqual() comparison.

All these could and would have been so much simpler if there was only one form of normalisation, or if visually identical characters had but a single Unicode code point. Until then, be aware that every now and then normalisation problems can appear out of the blue and cause strange errors. To make this a little easier to handle, I will shortly be building the features in Apfelstrudel into Mints, so they’re more accessible and easier to understand.