Why we can’t keep stringing along with Unicode

The world is full of fakes, and always has been. From the day that someone first noticed how similar iron pyrites (“fool’s gold”) is to gold, there have been people palming it off. It’s not just news and luxury goods, either: a little fake Adobe Flash updater can bring a payload of pain to a computer user.

Your eye may be sufficiently trained to tell iron pyrites from the real stuff, but when it comes to words, the sharpest-eyed expert can easily be fooled. Even in these days of AI, AR, and VR, character strings remain crucial on our computers: in particular, we use them to identify websites, folders, files, as well as holding content.

Internally, it matters not whether these words are handled as IP addresses, hash codes, or whatever. When we look at the identifiers of most (virtual) objects, we see them as strings of characters. And we trust what we think we see.

I have already explained and demonstrated how easy it is to fool people with crafted characters, whether in file and folder names, or in URLs. The problem boils down to the fact that Unicode represents many visually-identical characters (or forms) using different encodings.

Although there is clear strength in the case for human languages being provided with full character sets, Unicode has been profligate. For example, K is a capital letter k and encoded as UTF-8 48, and K is the KELVIN SIGN, defined in standards as a capital letter k, with the UTF-8 encoding of E2 84 AA. There’s also a Greek Κ, GREEK CAPITAL LETTER KAPPA which is CE 9A. There are separate number forms of Ⅼ, Ⅽ, Ⅾ, Ⅿ, and their lower-case forms, which are encoded quite differently from their regular ‘Latin’ equivalents. There’s a Greek Ϲ, a Cyrillic С, a mathematical 𝖢, and possibly a few more. Then there are accents…

The process of normalisation is intended to address that, but is incomplete and inefficient. Several have questioned my claim that it is incomplete: the screenshots showing visually-duplicated names of folders shown in this article were all made using macOS Sierra 10.12 with its normalising file system, HFS+, and Apple’s specially-designed system font. Can you tell the fake characters from the ‘real’ ones?

Of the three Ks cited above, KKΚ, only the first two normalise, leaving the Greek capital kappa encoded differently after normalisation. Of the five Cs, CⅭϹС𝖢, none normalises (under Forms C or D) and all five retain different encodings, unless you use the rare KC or KD forms.

Modern file systems, like Apple’s APFS, are trying to avoid performing any normalisation in their bid to improve performance. It does seem crazy that a file system should be burdened with performing lengthy look-up tasks on each folder and file name to check whether it could or should be normalised to something else.

The problems extend far beyond file systems and URLs. Software does a great deal with strings, and they often end up being compared, searched, or sorted, operations which are heavily-dependant on the way in which strings are encoded. Whenever there are strings which can appear to a normal human being to be the same, but which are encoded differently, there will be scope for confusion, error, and exploitation.

Unicode is one of the foundations of digital culture. Without it, the loss of world languages would have accelerated greatly, and humankind would have become the poorer. But if the effect of Unicode is to turn a tower of Babel into a confusion of encodings, it has surely failed to provide a sound encoding system for language.

Neither is normalisation an answer. To perform normalisation sufficient to ensure that users are extremely unlikely to confuse any characters with different codes, a great many string operations would need to go through an even more laborious normalisation process than is performed patchily at present.

Pretending that the problem isn’t significant, or will just quietly go away, is also not an answer, unless you work in a purely English linguistic environment. With increasing use of Unicode around the world, and increasing global use of electronic devices like computers, these problems can only grow in scale. I’m not aware of any exploitation of Unicode encoding in malware, but if we leave it until there is an established security problem before even considering how to address the issue, then the vulnerability will remain open for a long time to come.

Having grown the Unicode standard from just over seven thousand characters in twenty-four scripts, in Unicode 1.0.0 of 1991, to more than an eighth of a million characters in 135 scripts now (Unicode 9.0), it is time for the Unicode Consortium to map indistiguishable characters to the same encodings, so that each visually distinguishable character is represented by one, and only one, encoding.

That is a stark challenge, and one that I am sure will never even be started. But until we do, today’s minor running sores will only fester and grow.

2Comments

Add yours

1

has on June 10, 2017 at 1:07 pm

“it is time for the Unicode Consortium to map indistiguishable characters to the same encodings”

Utterly disagree. (Just ask Asians how they feel about Han Unification.) Glyphs are *semantic* as well as syntactic. The fact that humans are complicated, messy, imperfect creatures will not be solved by dragging all our tools back to the ASCII ages, no matter how much nerddom’s rampaging martinetism may lust for the Good Old Days of #ItWorksForThem and all us riff-raff are safely shut out.

What *is* needed is proactive, not reactive, measures; the [cough] clever people who put together Unicode standards should’ve seen visual spoofing coming long before it because a live problem, and baked in rules that must be applied in security-sensitive contexts. (A simple rule, for example, is to highlight any mixing of, say, Roman and Cyrillic characters within the same word.)

In any case, demanding visually similar glyphs be unified fundamentally confuses symptoms for disease. The heart of the problem is Trust, particularly *levels* of trust, and the abominable lack of infrastructure and expertise—or even basic awareness or consideration of what this requires. Lots of nerds got very rich and powerful building today’s technological Death Star; it’s WAY PAST TIME they were called out on the thermal exhaust ports that stud every other inch of the bloody thing and MADE to plug them all up before they go any further.

Meantime, just waiting for someone to figure a wetware exploit using internet-downloaded fonts. You can’t even trust a font to display a known set of glyphs at a known set of codepoints. (Who here remembers System 7’s “Symbols” fonts?) Heck, even today I still have to deal with custom client fonts that remap standard A-Z to every sort of weird and wonderful icon ever. (And if you think discretionary ligatures are a ballache, just wait till you discover contextual alternates!)

The naivety of geeks ought to beggar all belief—but this is what happens when amateur children have absolutely no interest or motivation in learning how the really world actually works, and there just aren’t enough professional adults to smack them until they do. That is what needs to change, or else we’re just piling endless more layers of band-aided incompetence upon all those we already have.

LikeLike
- 2
  
  hoakley on June 10, 2017 at 9:04 pm
  
  I don’t recall invoking non-Latin characters above. Yes, characters/forms/glyphs bear meaning, but that is largely (often solely) in the context of the others around them.
  Perhaps we need to go back to basics to see this more clearly. An encoding scheme, which is all Unicode is, must be an unambiguous 1-to-1 mapping. For each differentiable member of one set, there must be one and only one member of the other set, whether proceeding from the original to the encoded form, or in reverse.
  That is not the case in Unicode. As I show above, there are multiple characters – simple forms, not complex glyphs – which are indistinguishable but which have multiple codes. It is as serious a problem as if it occurred the other way around, e.g. that both C and c were represented by the same UTF-8 code.
  The outcome is not just vulnerability. It breaks many string operations, such as compare, search, and sort, as I showed above. Words which appear to contain identical characters are sorted in a different order. If you try searching for a given string, you will not find a string which appears identical, because it has a different encoding.
  When there is a fundamental flaw like this, particularly when it is not systematic (there are no rules which you can apply to work around it, only a huge table of normalisations, which as I have shown above are not even complete), you cannot paper over that flaw by imposing arbitrary restrictions on the use of your encoding – that defeats the purpose of having a UNIcode in the first place. To impose overheads on every single string comparison operation is frankly crazy when you could (and should) get the design right in the first place.
  Why do we need a character with a different encoding for, say, the Roman numeral for 100, C? Or for the physical unit Kelvin? No Roman inscription in Roman numerals ever represented its Cs in a different letter form from those of text, and the standard definition of the symbol for Kelvin is actually the capital Latin letter K, not a special symbol at all.
  Normalisation is band-aided incompetence, trying to fix a fundamentally flawed scheme, which is not UNIfied but a collection of disparate character sets, and doesn’t meet the mathematical requirements of a CODE – hence the problems.
  Howard.
  
  LikeLike

Share this:

Related