APFS is currently unusable with most non-English languages

The time had come to test out my fears over problems with file and folder names in Apple’s new filing system, APFS. The TL;DR is that APFS is not currently safe to use with names which might have Unicode normalisation issues – which means it is only safe with a limited ASCII character set, as shown in the bizarre screen shot above.

You can’t partly normalise

For those who have not read my previous posts, here’s a quick recap of the nub of the problem. Knowledgeable regulars can skip ahead to the next section.

Unicode contains a vast number of characters, many of which have different Unicode numbers, but are in fact the same character. A simple example is the letter e-acute: this can be represented by é, which in UTF-8 encoding is the two hex bytes c3 a9, or by é, which is the three hex bytes 65 cc 81. In some fonts there may be small differences, but in most cases we see identical characters and expect our computers to treat them the same.

There are four systems provided in the Unicode standard to do this. Currently, the Mac Extended file system, HFS+, uses Normalisation Form D (NFD). Under that, é and é are automagically converted to é, and represented as three bytes, 65 cc 81. In HFS+, this is done at the file system level, which means that everything that runs on a Mac, no matter whether an app, a shell command in Terminal, or macOS itself, works with normalised file and folder names. HFS+ prevents anything from creating non-normal names.

Apple’s new file system, APFS, already running on all iOS devices which use iOS 10.3 or later, does not perform such normalisation, but respects whatever Unicode characters are used, whether or not they have been normalised. Instead, normalisation is built into the higher-level system commands which work with files and folders. Apple’s advice to developers, therefore is:
To avoid introducing bugs in your code with mismatched Unicode normalization in filenames:

  • Use high-level Foundation APIs such as NSFileManager and NSURL when interacting with the filesystem
  • Use the fileSystemRepresentation property of NSURL objects when creating and opening files with lower-level filesystem APIs such as POSIX open(2), or when storing filenames externally from the filesystem.

 

I and others have already expressed our concerns that some software does not follow that, and perhaps cannot because of the nature of the calls which it has to make, which are not (yet) supported in high-level APIs. Thus we feel that there is the potential for bugs to arise when running on APFS.

The risk of this happening is greater the more un-English the language being used, as normalisation affects those languages more, and regular English least. Switching from a file system which always normalises to one which leaves it to the operating system and app developers to handle would thus lead to mixed or partial normalisation, which will prove a problem.

Enter the command line and scripts

Because HFS+ normalises at the file system level, command shells (in Terminal, etc.) do not normalise. Not only that, but they cannot afford to if the file system doesn’t normalise, otherwise they would only be able to access files and folders with normalised names. It is all very well an app only seeing files with normalised names, but command tools must be able to see and work with all files and folders for which they have access permissions.

Having now used Terminal and current apps on an APFS volume, I am convinced that what currently happens is little short of catastrophic: it will unexpectedly break all sorts of tools, including use of command shells, scripts, possibly many property list files used to configure Launch Agents and Demons, and more.

The reason is that commands run in Terminal or otherwise on an APFS volume do not undergo normalisation of file or directory names. By a strange quirk of fate, the normalisation form chosen by Apple for HFS+ (NFD) uses the Unicode characters which are least accessible from the keyboard. When I type in Terminal
touch café.txt
the name of the file created uses the non-normalised character for e-acute. However, higher-level calls, such as those used by the Finder and apps, work (as Apple promises) using normalisation. So the file of that name appears in some Finder windows, but disappears in the Icon view, as its name is non-normalised. If I then save a file from an app using that same name, its name undergoes normalisation, and instead of it creating a file with the name
café.txt
it creates a file of the name
café.txt

apfelstrudel16

Although the latter may look identical, and does so when listed in Terminal, it uses different Unicode characters. So Terminal can now see two files in the same folder, with names which are apparently identical, but consisting of different Unicode characters (which I cannot distinguish).

apfelstrudel15

However, the Finder knows that there are two files, but in Icon view shows only one. When I try to open either file using an app (which uses the higher-level normalising calls), then the app will only open the one bearing the normalised name. If I try to open the other, because the filename is then normalised, it will always open that with the normalised name.

apfelstrudel10

If I then trash the file with the normalised name, leaving just the one with non-normalised characters, the app will see that, but if I try to open it, reports an error code of -43 telling me that the file doesn’t exist – that is because it tried to open the file with the normalised name, which is no longer there.

If I trash the file with the non-normalised name, the app and Finder are happy. However, if I then type in any Terminal command, say, to delete it, such as
rm café.txt
that will pass the non-normalised name, not the normalised one, and returns the error
rm: café.txt: No such file or directory

However, as I do not know of any keyboard combination with which to enter the e-acute character in that filename, I cannot delete that file explicitly, but would have to resort to a pattern such as caf*.txt instead.

The worst problem of all is that, without using my new free tool Apfelstrudel (see Downloads above), it is almost impossible to tell the difference between any of these names which appear identical to the eye.

Scripts multiply

Commands, passed through a shell or directly, are used very extensively across macOS. The filenames and paths used in those are currently, with HFS+, all normalised by HFS+ before they reach their targets. If you set the tool /usr/local/bin/café to run on demand, there is no ambiguity as to which command tool that refers to. In APFS, the potential for almost invisible error to occur is great, and is amplified by the fact that, in most cases, the Unicode characters which we type on our keyboards are not those to which characters are normalised.

There is no simple solution

One possible solution might be to apply normalisation of file and folder names in the command shell, although that would appear to me to be technically very hard to implement. However, that would deprive command tools of access to files and paths which used non-normalised names, which would cause major issues the moment anything created a file or path including non-normalised Unicode characters.

It is feasible that shells could have some sort of escape option, perhaps putting names in double quotes "" if they were to be passed without normalisation. However that would break a large number of existing scripts and calls, and be a major compatibility issue too.

In some ways, there are similarities with the issue of case-sensitivity in file and folder names, and Apple has already discovered how thorny that is. I am amused that high-level calls in iOS work to prevent case clashes, even though iOS uses (and used) a case-sensitive file system (in both HFS+ and APFS). The big difference from case-sensitivity is that we can all see upper and lower cases quite clearly. Without a tool such as Apfelstrudel, you cannot see any difference between normal and non-normal forms. Which turns a nasty problem into complete chaos.

Using APFS with accented and non-Roman characters is like being kissed when wearing a blindfold: most of the time it is pleasant, but every now and again you really regret it.