One way to work around potential problems with the normalisation of Unicode strings – as are likely to occur when working with APFS volumes – is to take charge of the process yourself.
Let’s say that you want to search a list of filenames for a particular group of Unicode characters, and you are not able to call a provably normalisation-insensitive search routine. One answer to that is to normalise all the file names and the search string yourself, before performing the search. This is effectively what HFS+ does for us now, but which APFS itself will (apparently) not do in the future.
If you are coding in Python, then it has a unicodedata
module in its standard library, with the function unicodedata.normalize(form, unistr)
, with supported forms including NFC, NFD, NFKC, and NFKD.
iconv
is a general-purpose command to convert between different string representations, which is documented in man iconv
, although to see all the format options you will need to man iconv_open(3)
or list them using iconv -l
. Its macOS port includes a format option UTF-8-MAC
or UTF8-MAC
which performs normalisation to NFD, although I cannot see any options which support the three other normalisation forms.
There is a perl version of inconv
in piconv
, although I don’t know whether that supports NFD normalisation. If you are working in perl, a better choice might be Unicode::Normalize
, called as
NFD($string)
or
normalize('D', $string)
to return the NFD normalised form. This supports C, D, KC, KD forms.
The old and large development library uconv
supports commands such as
uconv -x any-nfd string
and
uconv -x any-nfc string
which may be a help.
If you are programming in one of the languages supported by Xcode, including Objective-C and Swift in particular, you should have access to the string type NSString
, which includes the four explicit functions
NSString.precomposedStringWithCanonicalMapping
to return form C (NFC)
NSString.decomposedStringWithCanonicalMapping
to return form D (NFD)
NSString.precomposedStringWithCompatibilityMapping
to return form KC (NFKC)
NSString.decomposedStringWithCompatibilityMapping
to return form KD (NFKD).
There are many other circumstances in which you might want a convenient command tool, so I have created unorml
, which is available in Downloads above.
unorml
simply returns the Unicode normalised version of the string which you provide as input, using the four options -c, -d, -kc,
or -kd
for NFC, NFD, NFKC, and NFKD respectively. So
unorml -d 'café'
will return the D form
café
It comes with its Swift 3.1 source, the code-signed command tool, a ReadMe.txt file, and a signed Installer package ready to install the tool in /usr/local/bin.
Its source is very simple, and listed in the two formatted sections below.
The business end is simply:
let argCount = CommandLine.argc
if (argCount > 2) {
let first = CommandLine.arguments[1]
let second = CommandLine.arguments[2]
let (option, value) = getOption(first.substring(from: first.characters.index(first.startIndex, offsetBy: 1)))
let theStr = second as NSString
switch option {
case .formc:
print(theStr.precomposedStringWithCanonicalMapping)
case .formd:
print(theStr.decomposedStringWithCanonicalMapping)
case .formkc:
print(theStr.precomposedStringWithCompatibilityMapping)
case .formkd:
print(theStr.decomposedStringWithCompatibilityMapping)
case .help:
printUsage()
case .unknown:
fputs("unorml: unknown option\n", stderr)
printUsage()
}
} else {
printUsage()
}
This allows you to normalise strings from anywhere that you can run a command, such as inside apps like Tinderbox and Storyspace, via AppleScript, or in Terminal itself.
I hope that proves useful to you.