Beyond Scripting in Swift: Of characters and closures

The task of Dystextia would have been very simple in the past: look through a long string of byte-sized characters, replacing them according to a look-up table. With Unicode, you cannot do that directly, as each character can occupy a variable number of bytes, and in many cases here we’re going to replace a single-byte character with three bytes. This is not something to try with a neat old C hack, but must be performed carefully through the right classes and calls.

At the same time, performance is going to be important. One of my test files is over 5 MB of XML, and an inefficient solution is going to show up quickly on that. Having spent some time looking carefully at the best character substitutions, I have come up with a list of 24 to be made, and that is likely to change as the app develops.

The quick and easy solution might be to use the String function replacingOccurrences(of: String, with: String), for example
let aString = "This is the string"
let newString = aString.replacingOccurrences(of: "a", with: "а")

There are several problems with that. First, it works with strings and not characters, so unless it has been optimised for characters as well, it is going to carry a lot of overhead. But most obviously it can only perform one substitution at a time: working through all 24 will mean iterating through the entire text string 24 times, testing every character on each pass.

Thankfully Swift provides a neat way to perform character-based operations on a String, through String.characters, and within the resulting CharacterView there is the function map(_:) which allows us to iterate through each of its characters. This is a mapping closure, in that we have to provide it with a closure, which is passed one of the characters, to return it transformed for substitution back into a new string.

There’s a very idiomatic example of its use in
if (aString != "") {
newString = String(aString.characters.map { $0 == aReg ? aMang : $0 })
}

which compares the character, referenced as $0, with a character aReg; if they are equal, then the character aMang is returned, to be substituted into the new string newString being built.

For Dystextia, we need to perform 24 such comparisons for each character, with 24 possible substitutions, which is best performed in a separate function called from within the closure as
if (aString != "") {
newString = String(aString.characters.map {
return encodeChar(inC: $0)
})
}

There are several different ways that we could implement the function encodeChar() and its lookup table, such as with an enumeration and switch. I am wary of the efficiency of those, as each character has a very large number of possible values, the frequency of characters varies widely (from e to z), and starting with an enumeration of 24 values is already fairly large. Although these other solutions can be more concise to code, for the time being I am going to work with simple linear variables and an if {} else if {} cascade.

So the characters are declared as individual variables, in matching pairs
let aReg = Character("a")
let aMang = Character("а")

for the lower-case letter a, transforming from the regular form a to the encoded (mangled) form а.

dystextia10

Then the function to encode runs from
func encodeChar(inC: Character) -> Character {
if (inC == aReg) { return aMang }
else if (inC == aCapReg) { return aCapMang }

down to
else if (inC == yCapReg) { return yCapMang }
else { return inC }
}

dystextia11

The performance of this cascade can be tuned by arranging these in order of the frequency of letters, to minimise the number of comparisons required for each character. Tuning will also need to compare this plain linear solution against more structured alternatives such as enumeration and switch.

That accounts for the action function for the Uniencode button; the Unidecode button simply performs the substitutions in reverse. Because I know that some will suggest that these alternative Unicode code points are already covered by normalisation forms, I also built in a button to perform normalisation on the text. This simply calls the String function decomposedStringWithCompatibilityMapping, which returns the normalised string using Form KD, which is the most extensive of the four normalisation forms.

The action functions then stuff the new strings back into the text scroller, as below.

dystextia12

There was one significant bug which made itself known after I thought that I had tested version 1.0 thoroughly: it used the new File menu commands, and the Duplicate command resulted in the app crashing in an unexpected quit. This turned out to be the result of leaving the NSDocument method to allow it to save in place, returning the Boolean true. For version 1.1, all I had to do was set that to return false, and Dystextia features traditional commands in the File menu, and doesn’t crash any more. I hope.