hoakley May 8, 2021 Language, Macs, Technology

Explainer: Unicode, normalization and APFS

One of the oldest problems with Apple’s APFS file system is how it encodes file and directory names using Unicode. This collides with one of the thorniest problems with Unicode, the fact that characters which appear identical can have two or more different code points (encodings). To see this in action, you’ll need a copy of my free app Apfelstrudel.

Open Apfelstrudel, place the cursor in its Input box and type on your keyboard the word Café, with an acute accent on the final e (press Option-e then the e key again to generate that). Then press Return.

normal01

In the next box down from Input, labelled HFS+, you’ll see that highlighted in red, and the word Café repeated. This is because, although those two renderings of the word appear identical, they use different Unicode code points. The version you typed in is in ‘normalised Form C’, while that used in the old HFS+ file system would be in ‘normalised Form D’.

Now try this simple test. Create a new folder and paste the Input version of Café as its name. Then create another new folder alongside it (in the same folder) and try pasting the HFS+ version as its name. The Finder won’t let you, as it considers that those two names are identical, just as they appear to be, even though they contain different Unicode code points, and even their lengths differ.

normal02

This happens in HFS+, which is a ‘normalising file system’ to avoid you becoming confused by two items which appear to you to have identical names. What happens in HFS+ is that, if you try to name an item using Form C, the file system automatically converts it to Form D. So although you may have provided two different names, HFS+ normalises them both to Form D, in which they really are identical.

Normalising file systems used to be quite popular, but by the time Apple came to design APFS, such behaviour was falling out of favour. In its original specification, APFS was billed as being non-normalising (the original phrase referred to filenames as just being a ‘bag of bytes’). That attracted a lot of criticsm from macOS developers, who could see the problems it would bring to users and apps alike. In spite of that, Apple’s engineers stuck to their guns.

When APFS was first released, chaos ensued, not so much with languages based on Roman alphabets, but worst of all in Korean, where there are a lot more normalisations. Among the biggest casualties were Apple’s own apps, so its engineers worked on a fix. Being confident still that APFS had made the right choice, they didn’t make APFS a normalising file system, but applied normalisation wherever it’s needed in macOS, and that’s what you see going on in my little demonstration above, and why you can’t have items using both normalised forms in the same folder.

For the great majority of users, this works fine. APFS doesn’t normalise, which makes it a bit more efficient, but a normalisation layer in macOS ensures that file and directory names are normalised, so APFS behaves just like HFS+ did. Except that they don’t always.

There are two potential problems which can appear, apparently out of the blue.

The first is with apps which generate their own filenames (and foldernames) from un-normalised Unicode text. Let’s say I have an app which creates and maintains its own image library, based on metadata stored with each image. If I as a user save a metadata field for an image using the Form C version of Café, which is the more likely as that’s what’s generated from the keyboard, and the app tries to use that as the filename, macOS should normalise that to Form D. If that app is unaware of the normalisation, metadata and filename end up being different, which can cause misunderstanding. Developers need to be aware of this and track the file path using the correct form, or even better using more independent mechanisms such as bookmarks.

More likely and more serious are the conflicts which can occur when using different methods of accessing non-Mac file systems. Thomas Tempelmann has demonstrated this using a share on a NAS, which he mounted first using NFS, then created a file with a name in Form C. When mounted via SMB on a Mac, as that filename is un-normalised, it can’t be accessed, as he has described here. Apple’s recommended solution is to mount NFS shares with the nfc option enabled, which should ensure normalisation is performed to the expected Form D. As ever, Michael Tsai has a succinct summary here.

There are all sorts of other ways that Unicode normalisation can trip apps up. Apfelstrudel shows some of them in its lower text view: Form C and D strings should compare correctly using Swift == and NSString compare() when caseInsensitive, but not with NSString isEqual() comparison.

All these could and would have been so much simpler if there was only one form of normalisation, or if visually identical characters had but a single Unicode code point. Until then, be aware that every now and then normalisation problems can appear out of the blue and cause strange errors. To make this a little easier to handle, I will shortly be building the features in Apfelstrudel into Mints, so they’re more accessible and easier to understand.

13Comments

Add yours

1

VRic on May 8, 2021 at 7:28 pm

Thank you for this great summary of the filenames mess we’re in and the tools you give us to navigate it.

One of the pitfalls that you hint at but I haven’t seen discussed is that, as far as I know, you can’t easily count filename lengths in AppleScript, which would seem a rather basic need.

(I haven’t checked recently, having been off the AppleScript-users list since my “temporary” hiatus from the Emailer-list)

The issue here is: Because HFS+ filenames are limited to 255 (formerly 31) normalized (decomposed) UTF-16 code units, it’s necessary to bound a string to that length before naming something, but a string’s length counts visible (composed) characters, which is irrelevant for that purpose (and always shorter whenever discrepancies arise).

We could ask the surprisingly fine open source text editor CotEditor for help, but 3rd party apps or CLI tools aren’t a reasonable requirement to distribute otherwise basic scripts.[1]

Of course APFS muddies the waters even more and I refuse to chase an hypothetical workaround down its probably bottomless rabbit hole. As of now I just throw my hands up (and also a dialog asking for another filename to try, which is utterly inappropriate to run unattended).

[1] CotEditor includes a (menu and AppleScript) command to switch between normalized forms of Unicode text, similar to your Terminal command unorml.
https://coteditor.com/
https://github.com/coteditor/

On a sidenote, it’s not often we see a scriptable open-source project. In fact, if anyone can think of another one I’d like to know about it.

LikeLiked by 1 person
- 2
  
  hoakley on May 8, 2021 at 9:59 pm
  
  If your product is freeware, you’re very welcome to ship unorml with it in its standard installer package, if that would help. Sadly, Apple has neglected to maintain AppleScript to keep in touch with what’s needed. There are accessible ObjC/Swift system calls which will normalise strings which you could also use.
  Howard.
  
  LikeLike
  - 3
    
    VRic on May 11, 2021 at 5:46 am
    
    Thank you for the suggestions.
    
    I’m aware of AppleScript’s access to system calls, but I confess it stayed outside my comfort zone due to laziness. I vaguely remember using that maybe to measure execution times, but I lost enthusiasm around the time Apple lost interest in AppleScript Studio.
    
    Also, having settled on routines written years ago, probably decades by now, I had no impetus to hack at it again for admittedly minor benefits (it’s more about knowing something isn’t *correct* than it not working right basically all of the time —I have fond memories of a time when I wrote correct code: it was beautiful Lisp; it was also never finished—).
    
    Although I sure have rewritten many scripts since the first lockdown!
    
    If I get around to fixing that nagging issue, I’d rather include unorml and let you do all the hard work… So, again thank you.
    
    LikeLiked by 1 person
4

jonshier on May 9, 2021 at 1:12 am

NSString’s isEqual is well known not to properly check string equality because it isn’t supposed to. It’s simply the inherited implementation from NSObject that compares object equality. For proper string comparison you’d start with isEqualToString: and go from there. Fortunately you should never see it in Swift unless you’re using NSString directly.

LikeLiked by 1 person
- 5
  
  hoakley on May 9, 2021 at 5:56 am
  
  Thank you. To be fair to isEqual(), this is clearly documented. However, there’s a multitude of string comparisons, some of which are based on normalised strings, and some not. If a developer isn’t particularly aware of the problems of normalisation and how it’s handled in macOS, these can be quite a pitfall, and there are some apps which have fallen into that pit.
  Howard.
  
  LikeLike
6

Raphael on July 2, 2021 at 5:56 am

> That attracted a lot of criticsm from macOS developers

Do you have any references for this? I had only ever heard about criticisms from developers that forced normalization was the problem (not not having it), including from Linus Torvalds (admittedly not a macOS developer): http://adam.curry.com/art/1421291458_hzcb9PRw.html

I get why Apple built forced normalization into their file creation APIs but I don’t understand why their file lookup/listing/open/save APIs would do it to.
If you pass a file path to an API, chances well above 99% are no-one typed it in manually (it either comes from a file a user selected in an open dialog, copy-pasted from the finder, or from a directory listing; in each of these cases, the normalization form you got is the one required to access the file).

LikeLiked by 1 person
- 7
  
  hoakley on July 2, 2021 at 7:40 am
  
  Thank you. A little search here would show that I first started writing about this in April 2017, and include links to discussions elsewhere. I even developed an app, Apfelstrudel, to assess the problem. Three early articles you might like to read are:
  File problems in iOS 10.3 and macOS 10.13: What’s in a name?
  Untangling file names and normalisation with Apfelstrudel
  APFS is currently unusable with most non-English languages
  and there have been many more since then.
  Howard.
  
  LikeLike
  - 8
    
    Raphael on July 2, 2021 at 2:50 pm
    
    So, in essence, there’s “a lot of criticism”, not from macOS developers, but from one macOS developer in particular, namely yourself.
    
    I, for one, would much prefer, if file systems wouldn’t normalize anything, not unicode, not case.
    
    Yes, it can be frustrating for some users if they happen to type different variants of a file name.
    
    But contrast that to how frustrating it is for users unable to open a certain file because, for some reason, it’s missing the normalization the system expects (as happens frequently on macOS with NFS shares; in fact, it has happened to me. None of the other potential problems you describe have ever happened to me or anyone I know. And yes, where I live we use lots of accents. And we do have keyboard layouts that allow entering both the combining and non-combined variants of many accented letters). Or the program that it should open with isn’t equipped to handle because it uses the wrong-level abstraction APIs.
    
    Never mind that there are already *many* ways file names can be distinct but not visually distinguishable. Like using a non-U0020 space (of which there are many in unicode: thin space, non-breaking space, …). Or using a Greek or Cyrillic letter that looks like a latin one. Or weird merged directories, like the virtual “Documents” folder in Windows that shows files from different directories with no restrictions on uniqueness.
    
    The only reason file names need to be unique at all is because we haven’t found a better way to refer to them (actually, Mac OS Classic did have better ways, but lets not get into that) and sometimes they have to be typed in. The former is just a leaky abstraction while the latter is increasingly uncommon. GUI app users probably almost never type in the name of an *existing* file (meaning except when saving for the first time). Terminal users sometimes do but often also use tab completion.
    
    Come to think of it, files shouldn’t even need to have names at all. Real files also don’t have names, just content.
    
    In my book, the comparison is as follows:
    
    Reasons for doing some normalization:
    
    • Make it easier for users to type in names to existing files
    • Make file listings less confusing for some cases
    
    Reason for not doing any normalization:
    
    • Train users that two strings that look the same may actually be distinct
    • Full interoperability with existing Systems (e.g., NFS)
    • Whatever you type in is what you get. Always. No Exceptions.
    • All files present in a directory can be opened, even though sometimes you have to copy-paste the file name to do it (in case of Terminal apps that don’t support tab completion)
    • File listings can still appear to have duplicate files, regardless of normalization
    
    Looking at this comparison it’s clear which drawbacks I’d rather live with.
    
    LikeLiked by 1 person
    - 9
      
      hoakley on July 2, 2021 at 3:10 pm
      
      No. That’s utter bollocks, and anyone involved with the early months of APFS is fully aware of all the controversy and problems over Apple’s design decision. I’m sorry, but if you don’t know how to research a topic like this, you’d better not be so judgemental of others.
      In just a few minutes, I offer the following developers who were in heated debate at that time and a couple since:
      David Reed
      Pierre Lebeaupin
      Michael Tsai
      Hacker News comments
      Michael Tsai and Thomas Tempelmann
      Now please go and read those articles and their comments, in which I had relatively little input, and then decide whether this was all just me. And explain why Apple had to introduce normalisation layers after it had released the first version of APFS on iOS. (Clue: the answer is that APFS’s lack of normalisation broke Apple’s own apps).
      You’re welcome to your opinion on the merits and demerits of normalisation, and I fully accept it’s a controversial issue. But please don’t accuse me of misrepresenting what happened, because you’re wrong.
      Howard.
      
      LikeLike
    - 10
      
      Raphael on July 2, 2021 at 3:41 pm
      
      Sorry, I retract my ad-hominem attack. It was uncalled for… And it gave you an excuse to focus on that while sidelining the entirety of my argument.
      
      I can’t speak for anyone you named but it seems to me your crown witnesses aren’t as clearly on your side as you think they are, at least in the posts you linked to:
      
      David Reed seems to only have been seeking confirmation of his assumptions.
      Same for Pierre Lebeaupin, though these questions seem to call out potential issues. Then again he also admits that “this should result in less issues for software”
      Michael Tsai seems to be mainly criticizing the road Apple chose to deal with this transition, not the transition itself.
      
      I agree with that criticism wholeheartedly: Apple botched the transition completely.
      
      They had the chance to get a clean slate that didn’t replicate past mistakes (i.e., normalization) but then decided to request that apps do their own normalization for save operation, which, rightly got a lot of devs angry because it meant more work for them.
      
      If they had instead just made it a recommendation (“for user convenience, please normalize typed-in filenames to the following form”), no-one would have complained. Caving to pressure from those devs, Apple decided to build normalization back into their high-level API (and botched it in such a way that broke NFS in the process). It seems to me Thomas Tempelmann was complaining about this (the fact that Apple’s API now uses normalization in their high-level API – not that the file system doesn’t normalize anymore).
      
      I think Apple’s original plan would have worked and caused far fewer headaches than whatever we had before with HFS+ and what we have now with low-level APIs behaving differently from high-level APIs.
      
      LikeLiked by 1 person
    - 11
      
      hoakley on July 2, 2021 at 4:49 pm
      
      Thank you.
      I’m not on anyone’s side, and have carefully re-read the article above, and consider it an accurate and faithful account of the early days with APFS and the problems that have arisen.
      Apple didn’t cave in “to pressure from those devs”. It did absolutely nothing, apart from telling us that was the way that APFS worked, so get used to it. It was only when Apple was faced with its own apps breaking wholesale in Korean – not any of the accented European languages, where normalisation is relatively minor – that it suddenly introduced a normalisation layer, to address its own bugs.
      Furthermore, if you’ve seen my recent article about the bug in Disk Utility which allows users to name APFS volumes in either Form C or Form D, and then breaks Spotlight, you will realise that the problems still haven’t gone away, over four years later.
      Unicode was a big mistake, and since then every decision made about the problems it generated have also been erroneous. What we now face is a horrible mess, which could so easily have been prevented.
      The reality of files systems for ordinary users is the overriding consideration here. The fact that macOS can’t use the case-sensitive variant of APFS speaks volumes as to the problems resulting from not even being able to tell the characters apart.
      I stand by every word I wrote above.
      Howard.
      
      LikeLike
12

Thomas Bohn on July 25, 2021 at 6:58 am

For me as an user normalisation D is annoying. It seems to be invented by someone who never uses umlauts or things like that in their daily life. The behaviour of some applications is complete off and they have to fixed so that they can “un-normalise” the file names again. I remember the days when the bash version which came with macOS did not support it completely.

Just this weekend, I uploaded some folders to OneDrive via Chrome and when I wanted to sync them with another application this failed because of the normalisation.

LikeLiked by 1 person
- 13
  
  hoakley on July 25, 2021 at 2:09 pm
  
  Thank you. Yes, it’s a horrible mess. Imagine what it must be like in Korean!
  Howard.
  
  LikeLike

·Comments are closed.

Share this:

Related