Text, strings and Unicode

Plain text might seem simple, but no computer text is both plain and simple. Before the adoption of Unicode, there was a great deal more than ASCII, and Unicode hasn’t turned out to be any better. Here I’ll refer not to text, but to strings of characters from which non-styled text is composed.

Before Unicode

Original ASCII was only 7-bit, to reduce the cost of its transmission, as that was deemed sufficient to encode all the characters necessary for Roman/Latin alphabets. Those 128 characters, including a NUL of 00, were standardised in 1963, and were soon being specialised for different alphabets and extended for particular purposes. Among those is Mac OS Roman, which includes a full 8-bit character set of 256 and set the standard for Classic Mac OS.

At the same time that ASCII was being standardised, IBM was devising its own 8-bit standard EBCDIC encoding that was popular on its mainframe computers and those of other manufacturers. Fortunately few using early Macs came into contact with it, and its incompatibilities with ASCII.

In 1987 ISO standard 8859 was published in a series of parts, defining 8-bit character sets based on the ASCII original, catering for languages including ISO Latin 1 (in part 1, or ISO 8859-1), Latin/Cyrillic (8859-5), and others. Microsoft diverged to its own standards for Windows, in what’s generally known as Windows-1252 or Code Page 1252, over the period 1985-98.

Those encoding standards, and many variants known as code pages, only determine how characters are encoded, and not how strings of characters are formed. Classic Macs were predominantly programmed using Apple’s Object Pascal language, using P-strings. The first byte(s) in those strings gave the length of that string in bytes. At first, that used a single byte, limiting strings to a maximum length of 255 characters, but that was too restrictive and was increased.

Native string format for the C language work differently, and gives no length. Instead, C-strings are terminated with a null byte 00. That brought its own problems, in particular the fact that C-strings couldn’t contain null characters, or they’d be truncated. As you can imagine, accessing a C-string as if it was a P-string, or the other way around, ensured an ample supply of bugs.

Plain Roman or Latin strings were fairly simple, though. The word Mac can be represented as the following hexadecimal bytes:

  • 4D 61 63 in ASCII and all related standards for Roman/Latin character sets
  • 4D 61 63 00 as a C-string
  • 03 4D 61 63 as a P-string.

You may still encounter these in Mac OS Roman, or macOSRoman, usually with the UTI of com.apple.traditional-mac-plain-text. Things got more complex when working with non-Roman alphabets, for which you may come across isoLatin2 and several standards for Japanese including iso2022JP, japaneseEUC and shiftJIS.

Unicode

Proliferation of all these different character sets and code pages was clearly untenable, so a replacement universal encoding system, Unicode, was initiated in 1991 with the publication of the first volume of its standard. This cunningly mirrored ISO Latin 1 (8859-1) in its first 256 code points to minimise conversion of most text in Europe and North America. Since that first version with just over 7,000 characters in 24 languages, Unicode has grown to encompass nearly 160,000 in 172 languages.

The internal structure of Unicode is complex, with code points arranged in 17 planes and many blocks. The most commonly used characters can be encoded in any of three data types:

  • one to three bytes of UTF-8
  • two bytes in UTF-16
  • four bytes in UTF-32.

The most infrequently used code points (and characters) require a maximum of four bytes whichever of UTF-8, -16 or -32 is used. Thus, in normal use, UTF-8 is most efficient in terms of space required, with UTF-16 less efficient. Taking the word Mac as the example again, it can be represented using the following hexadecimal bytes:

  • 4D 61 63 in UTF-8, exactly the same as ASCII and its relatives
  • 00 4D 00 61 00 63 in UTF-16
  • 00 00 00 4D 00 00 00 61 00 00 00 63 in UTF-32.

Endianness and BOMs

Different processor architectures order bytes within words in different orders, known as big-endian and little-endian. This affects the order of bytes given in UTF-16 and UTF-32, but not UTF-8. So for those there are two variants, UTF-16BE and UTF-16LE, and UTF-32BE and UTF-32LE. When there’s any doubt as to which endianness is being used, that can be indicated using a byte-order mark (BOM), an initial non-character using FE and FF bytes. For UTF-16, those would be

  • FE FF 00 4D 00 61 00 63 for big-endian, or
  • FF FE 4D 00 61 00 63 00 for little-endian.

Although using a BOM is optional, if the string is likely to be used on platforms with different endianness, it can avoid misunderstandings, resulting in what’s known as mojibake, where misinterpretation turns the string into nonsense characters. When using Roman characters, for example, this can transform it into Japanese gibberish.

Apple’s developer documentation wraps these into a list of encoding options available for Swift Strings.

Normalisation

There’s still one more encoding issue with Unicode, normalisation forms. These result from some characters having two or more Unicode representations, which can play havoc with search and when comparing strings. For example, there are two common UTF-8 encodings for the letter e with an acute accent: Form C consists of C3 A9, while Form D is 65 CC 81. Thus, the common word cafĂ© can be encoded in UTF-8 using either

  • 63 61 66 C3 A9 (Form C), or
  • 63 61 66 65 CC 81 (Form D).

When you’re searching a text document for that word, wouldn’t you want to find both forms? This becomes even more critical in file systems. Should they treat those two names as equivalents, or allow you to name two files in the same folder using the two different forms?

One common approach is to normalise Form C to D, which is used in both HFS+ and APFS, but other file systems may normalise to Form C instead, and others may not perform any normalisation at all. The results can be disastrous, as there’s still no consensus as to the best approach, and it’s almost unheard of to declare which if any normalisation has been performed on Unicode strings.

As I wrote at the start, no computer text is both plain and simple, and most turns out to be far more complex than you envisaged.