Sort order, collation and the Finder

Have you noticed that the order in which items are listed in a Finder window is different from that used by the ls command in Terminal? For example, the Finder lists 20test.text before 200test.text, while ls lists them in reverse. You can see more differences between a longer listing of a test folder.

In the Finder, running with English (UK) as primary language and List sort order set to Universal:
01test.text
2test.text
02test.text
3test.text
20test.text
200test.text
Atest.text
åtest.text
atest2.text
átest2.text
åtest2.text
atest3.text
atest12.text
atest20.text
btest.text

Using ls in Terminal that becomes:
01test.text
02test.text
200test.text
20test.text
2test.text
3test.text
åtest.text
Atest.text
atest12.text
átest2.text
atest2.text
åtest2.text
atest20.text
atest3.text
btest.text

To understand the differences, I’ll consider the behaviours involved.

Numbers and ‘natural’ sort order

Perhaps the most obvious difference is how the two sort orders treat numbers. The Finder orders those according to their whole value, whether the numbers are at the start or end of the name. Thus in the Finder’s view 2 and 02 precede 3, as they’re less, and the highest number and last of that sequence is 200. The ls command simply compares them one digit at a time, so 0 precedes 2 regardless of what digits follow it.

As we often embed numbers in file names, this is important, and in the latter years of classic Mac OS became a contentious issue, with campaigners like Adam Engst and Stuart Cheshire encouraging Apple to adopt this ‘natural’ sort order in 1997. And it did, with full support for modern ordering available from OS X 10.6.

Case

File systems in Mac OS have traditionally been case-insensitive but case-preserving, unlike variants used by iOS. This means that files named Atest.text and atest.text cannot exist within the same directory, but wherever Atest.text goes it retains its name with an uppercase first character. Both orderings therefore disregard case, as is common practice. However, some schemes for sort ordering list uppercase before lowercase, and others do the reverse.

Accents and diacritics

Different languages, and sometimes even their regional variants, treat accents and diacritics according to different rules. Most commonly, for sorting purposes they are treated as having the same base character, but as shown here the order within that may differ. This is a complicated area, as illustrated by the Nordic letter Ø, which in Denmark and Norway is treated as distinct from O rather than an accented variant, and placed at the end of the alphabet after Z. If you’ve tried to look a term up in a Danish book, or use a Danish phone directory, you’ll know how confusing that proves.

Unicode normalisation

Some Unicode characters can be formed using more than one sequence of codepoints. For example, the accented character can be represented as UTF-8 c3 a9 (Form C) or 65 cc 81 (Form D), although they’re identical in appearance. Although early versions of APFS for macOS ignored normalisation, it now normalises filenames just as HFS+ does. An initial normalisation step ensures the existence of two different forms doesn’t affect sort order.

Unicode Collation Standard and macOS

What was once so simple in ASCII has become a complex set of rules that vary by language, region and practice. These have been standardised for Unicode in its collation algorithm, used to determine sort order of strings of characters. The rules appear to be embedded in a set of binary files found in ~/Library/Metadata/CoreSpotlight.

The user has limited control over the sort order used in macOS. It must be the least-used feature in Language & Region settings, where it’s only offered when there are additional languages like French included in its list of Preferred Languages. Collations are included in Foundation’s Locale, and third-party code has access to the same collation as used by the Finder through Foundation’s localizedStandardCompare().

In the last century sorting and searching were early and major topics in learning programming and computer science. Thankfully that was long before they became so complex and dependent on collation rules.