How to normalise strings, and a new command tool to help

One way to work around potential problems with the normalisation of Unicode strings – as are likely to occur when working with APFS volumes – is to take charge of the process yourself.

Let’s say that you want to search a list of filenames for a particular group of Unicode characters, and you are not able to call a provably normalisation-insensitive search routine. One answer to that is to normalise all the file names and the search string yourself, before performing the search. This is effectively what HFS+ does for us now, but which APFS itself will (apparently) not do in the future.

If you are coding in Python, then it has a unicodedata module in its standard library, with the function unicodedata.normalize(form, unistr), with supported forms including NFC, NFD, NFKC, and NFKD.

iconv is a general-purpose command to convert between different string representations, which is documented in man iconv, although to see all the format options you will need to man iconv_open(3) or list them using iconv -l. Its macOS port includes a format option UTF-8-MAC or UTF8-MAC which performs normalisation to NFD, although I cannot see any options which support the three other normalisation forms.

There is a perl version of inconv in piconv, although I don’t know whether that supports NFD normalisation. If you are working in perl, a better choice might be Unicode::Normalize, called as
NFD($string) or
normalize('D', $string)
to return the NFD normalised form. This supports C, D, KC, KD forms.

The old and large development library uconv supports commands such as
uconv -x any-nfd string
and
uconv -x any-nfc string
which may be a help.

If you are programming in one of the languages supported by Xcode, including Objective-C and Swift in particular, you should have access to the string type NSString, which includes the four explicit functions
NSString.precomposedStringWithCanonicalMapping to return form C (NFC)
NSString.decomposedStringWithCanonicalMapping to return form D (NFD)
NSString.precomposedStringWithCompatibilityMapping to return form KC (NFKC)
NSString.decomposedStringWithCompatibilityMapping to return form KD (NFKD).

There are many other circumstances in which you might want a convenient command tool, so I have created unorml, which is available in Downloads above.

unorml simply returns the Unicode normalised version of the string which you provide as input, using the four options -c, -d, -kc, or -kd for NFC, NFD, NFKC, and NFKD respectively. So
unorml -d 'café'
will return the D form
café

It comes with its Swift 3.1 source, the code-signed command tool, a ReadMe.txt file, and a signed Installer package ready to install the tool in /usr/local/bin.

Its source is very simple, and listed in the two formatted sections below.

unorml01

unorml02

The business end is simply:
let argCount = CommandLine.argc
if (argCount > 2) {
let first = CommandLine.arguments[1]
let second = CommandLine.arguments[2]
let (option, value) = getOption(first.substring(from: first.characters.index(first.startIndex, offsetBy: 1)))
let theStr = second as NSString
switch option {
case .formc:
print(theStr.precomposedStringWithCanonicalMapping)
case .formd:
print(theStr.decomposedStringWithCanonicalMapping)
case .formkc:
print(theStr.precomposedStringWithCompatibilityMapping)
case .formkd:
print(theStr.decomposedStringWithCompatibilityMapping)
case .help:
printUsage()
case .unknown:
fputs("unorml: unknown option\n", stderr)
printUsage()
}
} else {
printUsage()
}

This allows you to normalise strings from anywhere that you can run a command, such as inside apps like Tinderbox and Storyspace, via AppleScript, or in Terminal itself.

I hope that proves useful to you.