Untangling file names and normalisation with Apfelstrudel

We take for granted that our current Mac Extended file system, HFS+, protects us from many problems which can occur with file and folder names. Yesterday, I highlighted the potential for problems when working with Apple’s new file system, APFS, whether that is on an iOS device running iOS 10.3 or later, or later this year when Macs will start switching to use APFS.

Today, I have some practical demonstrations, a tool which helps you explore what normalisation is all about, and examples to illustrate my point.

Starting with the simple, create a new folder and give it the name (copied from here) café . Then, alongside it, create another new folder, and give it the name (copied from here) café . The Finder will stop you from creating that second folder, as it sees the names of the two as being the same.

In fact, they’re not. Copy and paste those same words from this article into a good text editor which can tell you how many characters there are in each. You’ll discover that the first word has four characters, as you expected, but the second has five. This is because the e-acute in the second word is created using two Unicode characters. Yet the words look identical.

Repeat the same test using the two Korean words 훈민정음; 訓民正音 and 훈민정음; 訓民正音 Again, they look identical, but in your text editor the first should consist of 10 characters, and the second no less than 18 – a very large difference. The Finder will not let you name two folders alongside one another with those two different sets of characters, though, as it normalises the first to the second, and they are thus identical.

Now repeat that test using the two words Aﬃnity… and Affinity… These don’t look quite identical, because the first uses a ligature character and a horizontal ellipsis, while the second uses separate ffi characters and three periods/stops. Although they read the same and have the same character count, you can create two folders in the same location with those names, because HFS+ sees them as different strings. Except that if you use a different form of Unicode normalisation, the first is normalised to the second!

With those two Affinity… folders, use Spotlight or Finder Find to look for the name of one of them. It will only find the one, and not the other. If HFS+ used NFKC or NFKD to perform normalisation, then Spotlight would find both, whichever of the versions you entered into it.

These issues are very hard to spot because they are almost completely invisible. Another odd example is the pair Åland and Åland which use the Nordic letter Å. You wouldn’t think that Unicode had too many different versions of Å around, but there are two: one is designated LATIN CAPITAL LETTER A WITH RING ABOVE, the other isn’t given a name but is U+0041 U+030A. There are two, because one represents the symbol for the Ångström unit, which is actually defined as being the Nordic capital Å.

Under the Form D normalisation used by HFS+, Å is normalised to the three UTF-8 bytes 41 cc 8a. Under the Form C normalisation used by Linux, it is normalised to c3 85. Under APFS, which preserves normalisation but does not perform it, a file copied from Linux will have the same visible name as one copied from HFS+, but both files will co-exist, because unlike HFS+, APFS does not normalise.

How will apps cope with such issues? How, indeed, will the Finder? Does the Finder show both those files if they are in the same folder, or only show one of them, and if so, which?

At the moment, it is very hard to look at these issues, because the only way of discovering exactly which Unicode characters are being used, and what happens to them under normalisation, is to look them up in the Emoji & Symbols panel, and then look in the Unicode reference to see how each character might normalise.

To help see what is going on, I have put together Apfelstrudel – a Sierra-only app which applies all four official normalisation forms to any text which you care to enter into it. It will also warn you if HFS+ will normalise it to something different, tell you exactly what it will normalise to, and give you a full account which you can save to a text file. The latest release is available from Downloads above.

Another way of exploring potential problems is to use iCloud Drive between an iOS device running APFS (iOS 10.3 or later) and macOS running 10.12.4. I have not yet explored this with Unicode normalisation, only with case sensitivity, and that is confusing enough. If you use an Apple app such as Pages to try to create two files in the same location with names that differ only in their case – such as help.txt and Help.txt – then Pages behaves like case-insensitive HFS+ and refuses to accept both.

Other apps do not behave the same way, though. Using an iOS text editor, I was able to create those two files in the same folder on my iCloud Drive. Only when I looked at them from my Mac, one was called help.txt, and the other help 3.txt – iCloud Drive was presenting macOS with different file names to those it presented iOS with. Although that kludges through the problem, it will cause other issues.

Now consider a macOS app which runs very nicely under Sierra on HFS+. Like Photos, it manages a database of images which also exist as individual image files. In the database are the path and file names for each of the images. The developers know that HFS+ will normalise the path and file names, so before those names are saved into the database, they diligently normalise each using macOS function calls, but use a low-level file save routine when the files are written out to storage. As that will still result in normalisation under HFS+, the name in the database is in perfect sync with that in HFS+.

Now move that app and its database to APFS, regardless of its case-sensitivity, and imagine you are just importing your photos from your recent holiday on Åland. The app still normalises the file names which it puts into its database, but the low-level file routines it uses passes un-normalised names to APFS, which preserves their lack of normalisation.

When you try to open some of those images, those containing the character Å for example, the file name in the database does not match a filename in APFS. Those images are now missing.

Even more catastrophic would be someone who managed to create a folder which had an un-normalised name, but who then tried to access it using software (like the Finder, perhaps?) which normalised the name. All apps which could use only normalised names would then lose access to that folder, as all the calls that they made would result in normalised versions of the folder name, which does not exist in APFS.

I hope that this impresses you that, for languages in which normalisation changes the Unicode representation of file and folder names, APFS could bring some very odd problems. I hope that you will find Apfelstrudel a good way of exploring these weird and fascinating problems, and (if you’re a developer) I hope that it helps you track potential problems down before we start using your apps on APFS.

Tomorrow I will step through the code used by Apfelstrudel, explaining how to normalise strings, and more.

2Comments

Add yours

1

al45tair on May 17, 2017 at 12:30 pm

You mention Å and comment that there are two possible variants of it, but actually there are three, because there is another character in the Letterlike Symbols block, U+212B ANGSTROM SIGN, which generally has an identical appearance. I think I’m right in saying that if you normalise that it will still turn into U+0041 U+030a. Similar issues exist for U+212A KELVIN SIGN (looks like “K”) and U+2126 OHM SIGN (looks like U+03A9 GREEK CAPITAL LETTER OMEGA).

There are also a number of other “fun” combinations that normalisation won’t fix, for instance U+0410 CYRILLIC CAPITAL LETTER A tends to bear a remarkable resemblance to U+0041 LATIN CAPITAL LETTER A, and these allow similar things to happen even in the current HFS+ — though generally you need to be deliberately malicious to get it to happen there.

LikeLike
- 2
  
  hoakley on May 17, 2017 at 3:27 pm
  
  Thank you for the correction, which makes life even more confusing! It had not occurred to be that there would also be separate Ohm and Kelvin signs – that seems more than excessive. I don’t suppose there’s a separate Celsius sign, Ampère sign… ?
  Howard.
  
  LikeLike

Share this:

Related