Normalising strings in Swift: scripting Apfelstrudel and beyond

When I came across potential issues over normalisation of Unicode characters, the most frustrating aspect was being unable to look at the problem without faffing around for ages pasting individual characters into the Emoji & Symbols panel. It was a bit like radiation – we know the dangers, can’t see it, and can only see its effects after time.

Having failed to locate a suitable tool for the task, the only option was to make myself one – Apfelstrudel. I had most of a day, but couldn’t afford to devote much more than that.

I started with my MacAppScaffold framework and a little time in a Swift playground (Xcode). The latter allowed me to work out my moves, which I then slotted into a development of the basic dialog in the scaffold interface.

The crucial function calls are the four normalisation calls in NSString:

  • precomposedStringWithCanonicalMapping – which returns a String normalised to Form C,
  • decomposedStringWithCanonicalMapping – which returns a String normalised to Form D, as used in HFS+ so the most important,
  • precomposedStringWithCompatibilityMapping – which returns a String normalised to Form KC, and
  • decomposedStringWithCompatibilityMapping – which returns a String normalised to Form KD.

Thankfully these are really easy to use.

The only tricky code is converting the UTF-8 strings to hexadecimal characters. There are lots of different ways to tackle that, but the approach which seemed the most straightforward was to use a mapping closure. This is an ingenious shorthand for transforming the elements in an array using a specified mapping. A classic example is changing to lower case:
let lowerCase = stringArray.map { $0.lowercaseString }
or you can use it to count letters in an array of words:
let counts = stringArray.map { $0.characters.count }

In this case, I need to map to a string representation of hex values for each of the bytes in the UTF-8 characters, which is succinctly
let stringArray = theString.utf8.map { String(format: "%02hhx", $0) }

This results in an array of strings, which collapse into a single string using
let theString = stringArray.joined(separator: " ")

For each of the four normalisation forms, I therefore run through a series of steps like:
let tStr1 = theString.precomposedStringWithCanonicalMapping
let tStr1a = tStr1.utf8.map { String(format: "%02hhx", $0) }
let tStr1b = tStr1a.joined(separator: " ")
textOut1.stringValue = tStr1b
theOutStr += "Form C: " + tStr1 + " hex: " + tStr1b + " length: " + String(tStr1.utf8.count) + "\n"

which writes the hex string to the correct text box, and formats and appends it to the string to go in the bottom scrolling box.

I also wanted to draw attention to whether normalisation using Form D would change the string, which is perhaps the most important piece of information for the user. I decided to do that by changing the background colour of the appropriate label. This required giving that static text a background, a setting in Interface Builder, then
if (tStr0b != tStr2b) {
textFixed1.backgroundColor = NSColor(red:1.00, green:0.00, blue:0.00, alpha:0.5)
theOutStr += "Original and Form D strings differ. Try using the Form D normalised string:\n" + tStr2 + "\n"
} else {
textFixed1.backgroundColor = NSColor(red:0.00, green:1.00, blue:0.00, alpha:0.5)
theOutStr += "Original and Form D strings are identical. Normalisation is not needed.\n"
}

The whole code is formatted better here:

apfelstrudel09

In the end, coding and testing took less than half the day, and most of the time was spent writing the Help Book. That seems a fair division of labour to me – it’s a shame that more don’t invest at least equally in documentation.