Using Unicode better

We all like the benefits of Unicode, such as its vast range of expressive emoji, its support for a huge number of writing systems, and the ability to mix text using different writing systems. But we sometimes stumble over its snags and shortcomings: problems with the normalisation of file names and Apple’s new file system, APFS, spring to mind.

Most accounts of Unicode strike me as being rosy-tinted and largely theoretical. When it comes down to important practical issues, such as how to cope with many different code points which are visually indistinguishable, they don’t offer any realistic solutions. I have found an exception, which makes valuable reading for anyone using computers – particularly Macs – with more than just the basic Latin writing system: it’s a free book, entitled The Unicode Cookbook for Linguists, by Steven Moran and Michael Cysouw, published by Language Science Press.

Don’t be put off by the inclusion of linguists in its title, nor by the fact that it’s a book. It’s less than 150 pages, and many of them will save you more time than you’ll take to read the whole text. The authors have spent much of their time recording languages using the International Phonetic Alphabet (IPA). As both Unicode and the IPA are evolving international standards developed by different groups for contrasting purposes, you can imagine how challenging that must be.

It may seem that Unicode, now with over 137,000 different code points, should be able to include everything within the IPA, but because the two organisations have moved in different directions, and differ over the concept of what a ‘character’ might represent, there are many mismatches, such as matching characters in the two standards being given different names.

Among the Unicode pitfalls identified are some essentials which must be grasped by anyone working beyond a single, simple writing system such as basic Latin. These include the distinction between characters, glyphs, and graphemes, which is explained lucidly. The practical solution proposed to the common problem of missing glyphs, that of a fallback font, may seem obvious, but many of us arrive at it only after a lot of trial and error. The book provides links to a good range of excellent fallbacks too.

Problems with apps performing automatic font substitution are then highlighted and explained, and Apple’s AAT and SIL’s Graphite solutions highlighted.

Close attention is given to my particular concerns about visually indistinguishable glyphs – homoglyphs – and multiple code points which represent the same characters. These are easily illustrated: A is LATIN CAPITAL LETTER A, U+0041, and Α is GREEK CAPITAL LETTER ALPHA, U+0391, which are homoglyphs. Å is LATIN CAPITAL LETTER A WITH RING ABOVE, U+00C5, and Å is ANGSTROM SIGN, U+212B, which are canonically equivalent and can be remapped using normalisation.

If you want to explore these further, I have two apps which might interest you: Apfelstrudel looks in detail at normalisation, and Dystextia shows how you can obfuscate text using homoglyphs. They are both free from Downloads above.

The authors point out that normalisation isn’t a perfect solution, being marred where the Unicode standard refuses canonical equivalence, and this is one of the problems tackled by their system of orthography profiles, which is what they are leading to. Each chapter ends with a set of practical recommendations; those for Chapter 3 could stand alone as rules for retaining sanity in an increasingly Unicode world.

For those unaccustomed to the IPA, chapter 4 gives valuable background and leads on to discussion of the problems using IPA in Unicode. This concludes with a letter-by-letter proposal for the standard encoding of IPA in Unicode, which for linguists will be one of the book’s most valuable sections.

Chapter 7 then explains why Unicode Locales are not a solution, and proposes orthography profiles, consisting of Unicode code points, characters, graphemes, and a set of rules for any given language. The authors specify this formally, and in the following chapter (and supporting source code) introduce implementations in Python and R.

On top of all this wonderful information, the Language Science Press is being highly innovative in this publication. The book’s source, in LaTeX of course, is a project on GitHub, together with its computer source code, and will evolve as open source. You can read more about this in the publisher’s article here.

This book deserves to be much more widely-read than its title might suggest. Even if you’re not particularly concerned about transcribing exotic languages using the IPA, it provides many excellent tips which anyone working with Unicode should study.

It also raises the much bigger question as to why, when we have such a vast and rich collection of ‘characters’, the only tool that most of us use is Apple’s “Emoji & Symbols” (sic) panel, which struggles to make emoji accessible, and largely fails when you want to do anything more serious with language.