High Sierra and filenames: Apple is relenting

macOS High Sierra brings with it a new file system, APFS. Designed to deliver much improved performance over the very old Apple Extended File System (HFS+) which remains standard in Sierra and earlier, it brings many other major advantages, such as snapshots and built-in support for encryption. It also looked as if it might come with some major issues, particularly for those who name folders and files using characters outside the old ASCII character set – using accents and non-Roman scripts.

Background

At present, in HFS+, the file system blocks us from having two files in the same folder whose names differ in capitalisation (HFS+ is case-insensitive), and whose names differ in the encoding of many Unicode characters such as those with accents (HFS+ is normalising). This helps ensure that files and folders with confusable names are not allowed side by side.

Although there is a case-sensitive variant of HFS+ which allows the co-existence of ThisFile.text and thisfile.text in the same folder, and that has been used for iOS devices, those who have tried using it on their Macs have found that many apps break as a result, as they assume that the file system is case-insensitive.

Ignoring case-sensitivity for the moment, handling characters with accents is complex because Unicode allows visually identical characters to have different encoding. To overcome such problems, HFS+ automatically converts those forms to use common encodings – a process known as normalisation (normalization, if you prefer).

Normalisation incurs overhead. Every time that HFS+ handles a file or folder name, it has to check each of the characters to see whether it needs to be converted to its normalised form. There are no simple or efficient conversions to be made, because there are no simple or mathematical rules which can be applied to determine whether and what normalisation might be necessary.

By eliminating the step of normalisation altogether, a file system can achieve substantial improvements in performance – and that is what APFS aims to do. It is by no means unique in this, which is becoming a feature of many modern file systems.

apfelstrudel10

The first variant of APFS to be released, in iOS 10.3, is both case-sensitive and normalisation-sensitive. Although not intended for routine use with macOS, I have described the very obvious problems which it can cause. However, in the much more restricted environment of iOS, it was thought that it should work fine.

Knowing the problems that result from the case-sensitive variant of HFS+, Apple wisely developed a case-insensitive variant of APFS which is intended for macOS, and first became available in preview form with macOS Sierra 10.12.4. I looked at that too, and found at that time that it resulted in fewer problems, but was still not a seamless switch from HFS+.

Apple now has a lot of experience with APFS in its iOS variant, and growing experience with its macOS variant. As a result, changes are being made for iOS 11 and possibly for High Sierra too.

iOS and case-sensitive APFS

Although iOS has a much more constrained range of apps, and limits access by the user in comparison with macOS, problems have been reported with APFS in iOS, apparently because of its lack of normalisation. As the file system in iOS (and watchOS, and tvOS) now stores filenames without any normalisation, there is the potential for the problems which I encountered when using this variant with macOS.

Apple is therefore adding normalisation to the case-sensitive variant of APFS. This is being offered at two levels, which Apple terms native and runtime. Runtime normalisation is not as efficient but works around existing problems, and will be available in the next release of iOS, 10.3.3, and later.

Native normalisation (essentially accomplishing what HFS+ has been doing) requires a file system update, and will come with iOS 11. However, at least initially, it will only apply to those users who perform an erase-restore when upgrading to iOS 11. Apple intends converting all devices – iOS, watchOS, and tvOS – to native normalisation probably later in 2017 or early next year.

Native normalisation is also supported in the case-sensitive variant of APFS for macOS, so will be part of High Sierra.

macOS and case-insensitive APFS

Apple still intends that installation of High Sierra will, by default, convert the startup volume to the case-insensitive variant of APFS, although it is not yet clear whether there will be options to stay with HFS+ or to opt for the case-sensitive variant instead. Although the latter may seem attractive now that it normalises, it may well break a great many apps, so should not be seen as a solution.

As yet, Apple has not announced any native normalisation in the case-insensitive variant of APFS which will be the default for High Sierra. As things stand, this means that macOS users may experience problems with some apps and scripting languages/systems which do not handle normalisation ‘properly’ – in other words, assume that the file system will perform the normalisation for them.

It may be that, with native and runtime normalisation incorporated into iOS, Apple decides to introduce either or both into case-insensitive APFS. That should ensure that ill-behaved and older apps work more robustly with filenames incorporating accents and other characters which normalise.

Have normalisation problems gone away?

Unfortunately, problems with normalisation, and with Unicode’s thousands of ‘confusable’ characters/forms/glyphs, have not vanished, nor will they no matter what Apple does with APFS. But the more normalisation that APFS does, the less you are likely to encounter them.

Users who should, perhaps, be most concerned are those who use their own scripts to work with file and folder names, and perform string comparisons in them. For the time being, case-insensitive APFS will store normalisable folder and file names as it gets them, in whatever form, and will therefore work with those names (even if it might perform normalisation checks when generating hashes for its own use). If your script assumes that the names that it gets from the file system have been normalised, then it may well suffer bugs when run on an APFS volume.

apfsvol09

String operations which may be sensitive to normalisation include all involving the comparison of characters, as used in search, compare, and sort, as I have discussed here. The effects which you may encounter include:

  • searches failing to find strings which you think should match, but which differ in normalisation,
  • strings being reported as being unequal on comparison, when they differ in normalisation,
  • odd sort orders when strings contain normalisable characters.

Some of these can be very difficult to detect.

To help you identify and address problems which may arise with High Sierra, I have two free tools available from Downloads above: Apfelstrudel, which lets you see different normalisation forms and their effects on some string operations, and the command tool unorml, which will normalise strings according to any of the four Unicode standard forms. The latter can be used if your scripting language/system/app does not support normalisation directly.

As experience with pre-release versions of High Sierra grows, we’ll no doubt hear more about these issues, and I will report them further.