Fun and confuddling with Unicode

We take far too many achievements for granted. Today’s unsung hero is Unicode, probably the biggest advance in the representation and dissemination of writing since the invention of printing with movable type.

For anyone who uses characters beyond the minimal Latin set supported by ASCII, the advent of Unicode on their computer brought a revolution. No longer were they plagued by mojibake, documents which opened in gibberish because they were given the wrong code page, and users were at last able to mix and match multiple writing systems in the same sentence. For the great majority in the West, who seldom if ever stray beyond ASCII, this passed almost unnoticed, but not for the many rich cultures around the world. That’s far more important for mankind than ever-increasing support for emoji.

Like all great things, though, Unicode has its quirks and oddities, features which a couple of my free utilities help you explore, use, and even exploit. This article introduces what you can do with Dystextia and Apfelstrudel.

Unicode is rich, almost too rich, in the code points or characters which it supports. Browse through some of those and you’ll notice many of them appear almost identical despite being given distinct codes. With judicious choice of characters, this enables you to create text which appears perfectly readable, but in fact uses non-standard characters to substitute for normal Latin ones. Convert text into this substituted form and it defeats normal search/find techniques.

dystext01

Here I’ve taken the first paragraphs of this article in standard Latin encoding, and can find words like dissemination as you’d expect.

dystext02

Click on the Uniencode button, and what you see looks essentially identical, but most of the characters used aren’t from standard Latin code points, so searching is now completely broken. You can return the text to its original Latin encoding using the Unidecode button.

dystext03

Coding changes are still more radical when you check the Maximum box then click on Uniencode. This visibly alters the characters shown, but it remains completely readable. Some government agencies now use ‘fuzzy’ text search techniques which can still find words in gently recoded text, but this should make even those techniques sweat hard to discover underlying content which you can still read clearly.

Another application of this recoding is to explore Unicode spoofing, which allows you to have two files or folders side by side with names which appear identical, but are in fact composed of different characters which just happen to look the same. This can also be applied to URLs, and abused in exploits. One caution with using recoded text is its effect on those using screen readers to hear what’s in text: those readers will stumble badly over recoded text and it will become unintelligible when read out.

One way that Unicode tries to tackle the problem of apparently identical characters with different code points is the technique of normalisation, which is the concern of the other app, Apfelstrudel. This is best-known in some accented Latin characters, but also occurs widely in some other writing systems such as Korean.

Unicode has ended up with two (and occasionally more) ways of encoding some characters such as the accented letter e: the é in café can be represented in Unicode UTF-8 as either c3 a9 or 65 cc 81, which are visually identical but different encodings. Some file systems such as HFS+ consider this a problem, so convert these to common or normalised forms. With a choice of two alternatives, as you can imagine there are two different and conflicting systems, known as Normalisation Forms C and D, and different file systems have opted for each.

This has caused many problems in the past. On Macs, HFS+ uses Form D, so entering café at the keyboard using c3 a9 to encode the final letter, it will be converted to Unicode 65 cc 81. On Linux, Form C is more common, so if you exchange files between Mac and Linux systems their names will become renormalised, and differ.

When Apple first announced details of its new file system APFS, it decided that it would have nothing to do with normalisation, and just store unnormalised names. Although many of us warned of the problems that would result, particularly when mixed with HFS+ on the same system, the first version of APFS on iOS didn’t normalise, which led to chaos. Since then, APFS has become compatible with normalisation, in particular that used by HFS+.

Normalisation problems haven’t gone away, though. There are still deep-seated problems with tasks like comparing strings. If your code is looking for the word café, wouldn’t you want it to find both normalised forms, not just one? You may be surprised to learn that some code comparisons, such as Swift’s NSString.isEqual(), don’t work properly across normalised forms, but others such as == do. Discovering whether a string is contained within another shouldn’t vary, though, but results can differ according to the version of the SDK which you build against.

apfelstrudel04

Apfelstrudel is the only app that I know of which lets you explore the anomalies and problems of Unicode normalisation. If you work with text manipulation it’s an essential tool.

Dystextia and Apfelstrudel are available from their Product Page in shiny new versions which run native on both Intel and Apple Silicon Macs. Have fun!

Share this:

Related