Volume names are still a Unicode mess

There are many differences between Apple’s two file systems, HFS+ and APFS. Some of the most messy result from a design decision in APFS that it won’t normalise file and directory names, as HFS+ does. This brought the first versions of APFS to the brink of disaster, before Apple engineered a workaround. Even today, four years on, there remain problems which can catch users out: this article explains how normalisation still makes a mess of volume names in Big Sur 11.4.

The underlying problem is that there are many Unicode code points which represent what are visually identical characters. While CaféÅngstrom and CaféÅngstrom might look the same, they use different code points. Some file systems, such as HFS+, manage this by normalising all directory and file names to one of two forms, C or D. Others, like APFS, don’t themselves perform any normalisation, just preserve whatever characters they’re given, leaving it to the system to handle issues arising from normalisation.

This article stems from Thomas Tempelmann’s (@tempelorg) observation on Twitter that, if you name a volume in Disk Utility, it can remain in Unicode normalized Form C, which isn’t compatible with the rest of macOS, which expects Form D to be used. Both names look identical, and it takes an app like Mints or Apfelstrudel to make it clear that the two names actually use different Unicode code points (characters).

normal01

Try this in the Finder and it behaves correctly, normalising to Form D to prevent any such confusion. In that case, to ensure that paths remain unique, one volume is known by its original name (using Form D), e.g. CaféÅngstrom, and the other with a 1 appended to the Form D name, e.g. CaféÅngstrom 1.

normal02

If you do use Disk Utility to create two volumes with what appear to be identical names, but actually differ in their normalisation, then behaviours become stranger still. My example here uses CaféÅngstrom (Form C) and CaféÅngstrom (Form D).

The Finder shows the two volumes with identical names, but when you try to copy those names, what you copy is normalised to Form D, so it’s incorrect for the volume with the name left in Form C.

In some places, the two volumes are distinguished differently, the ‘duplicate’ being numbered 1, as if they had been normalised to Form D.

normal03

In Terminal, they’re both renamed, by appending the numbers 1 and 2, which is different from the handling of two volumes with the same normalised names.

normal04

normal05

Spotlight indexing only works for the volume with the name which is in Form D, and the volume with the name in Form C isn’t indexed at all, so its contents can’t be searched. I don’t think this is a method which Apple intended to be used to exclude volumes from indexing, though!

The underlying problem seems to be a bug in Disk Utility, which fails to normalise volume names to Form D as the Finder does. But there’s also a bug in Spotlight indexing which results in volumes with Form C names not being indexed at all. APFS brought us many great things, but this initial design decision has only brought problems, complexity and bugs like these.