Text, strings and Unicode

Plain text might seem simple, but no computer text is both plain and simple. Before the adoption of Unicode, there was a great deal more than ASCII, and Unicode hasn’t turned out to be any better. Here I’ll refer not to text, but to strings of characters from which non-styled text is composed.

Before Unicode

Original ASCII was only 7-bit, to reduce the cost of its transmission, as that was deemed sufficient to encode all the characters necessary for Roman/Latin alphabets. Those 128 characters, including a NUL of 00, were standardised in 1963, and were soon being specialised for different alphabets and extended for particular purposes. Among those is Mac OS Roman, which includes a full 8-bit character set of 256 and set the standard for Classic Mac OS.

At the same time that ASCII was being standardised, IBM was devising its own 8-bit standard EBCDIC encoding that was popular on its mainframe computers and those of other manufacturers. Fortunately few using early Macs came into contact with it, and its incompatibilities with ASCII.

In 1987 ISO standard 8859 was published in a series of parts, defining 8-bit character sets based on the ASCII original, catering for languages including ISO Latin 1 (in part 1, or ISO 8859-1), Latin/Cyrillic (8859-5), and others. Microsoft diverged to its own standards for Windows, in what’s generally known as Windows-1252 or Code Page 1252, over the period 1985-98.

Those encoding standards, and many variants known as code pages, only determine how characters are encoded, and not how strings of characters are formed. Classic Macs were predominantly programmed using Apple’s Object Pascal language, using P-strings. The first byte(s) in those strings gave the length of that string in bytes. At first, that used a single byte, limiting strings to a maximum length of 255 characters, but that was too restrictive and was increased.

Native string format for the C language work differently, and gives no length. Instead, C-strings are terminated with a null byte 00. That brought its own problems, in particular the fact that C-strings couldn’t contain null characters, or they’d be truncated. As you can imagine, accessing a C-string as if it was a P-string, or the other way around, ensured an ample supply of bugs.

Plain Roman or Latin strings were fairly simple, though. The word Mac can be represented as the following hexadecimal bytes:

4D 61 63 in ASCII and all related standards for Roman/Latin character sets
4D 61 63 00 as a C-string
03 4D 61 63 as a P-string.

You may still encounter these in Mac OS Roman, or macOSRoman, usually with the UTI of com.apple.traditional-mac-plain-text. Things got more complex when working with non-Roman alphabets, for which you may come across isoLatin2 and several standards for Japanese including iso2022JP, japaneseEUC and shiftJIS.

Unicode

Proliferation of all these different character sets and code pages was clearly untenable, so a replacement universal encoding system, Unicode, was initiated in 1991 with the publication of the first volume of its standard. This cunningly mirrored ISO Latin 1 (8859-1) in its first 256 code points to minimise conversion of most text in Europe and North America. Since that first version with just over 7,000 characters in 24 languages, Unicode has grown to encompass nearly 160,000 in 172 languages.

The internal structure of Unicode is complex, with code points arranged in 17 planes and many blocks. The most commonly used characters can be encoded in any of three data types:

one to three bytes of UTF-8
two bytes in UTF-16
four bytes in UTF-32.

The most infrequently used code points (and characters) require a maximum of four bytes whichever of UTF-8, -16 or -32 is used. Thus, in normal use, UTF-8 is most efficient in terms of space required, with UTF-16 less efficient. Taking the word Mac as the example again, it can be represented using the following hexadecimal bytes:

4D 61 63 in UTF-8, exactly the same as ASCII and its relatives
00 4D 00 61 00 63 in UTF-16
00 00 00 4D 00 00 00 61 00 00 00 63 in UTF-32.

Endianness and BOMs

Different processor architectures order bytes within words in different orders, known as big-endian and little-endian. This affects the order of bytes given in UTF-16 and UTF-32, but not UTF-8. So for those there are two variants, UTF-16BE and UTF-16LE, and UTF-32BE and UTF-32LE. When there’s any doubt as to which endianness is being used, that can be indicated using a byte-order mark (BOM), an initial non-character using FE and FF bytes. For UTF-16, those would be

FE FF 00 4D 00 61 00 63 for big-endian, or
FF FE 4D 00 61 00 63 00 for little-endian.

Although using a BOM is optional, if the string is likely to be used on platforms with different endianness, it can avoid misunderstandings, resulting in what’s known as mojibake, where misinterpretation turns the string into nonsense characters. When using Roman characters, for example, this can transform it into Japanese gibberish.

Apple’s developer documentation wraps these into a list of encoding options available for Swift Strings.

Normalisation

There’s still one more encoding issue with Unicode, normalisation forms. These result from some characters having two or more Unicode representations, which can play havoc with search and when comparing strings. For example, there are two common UTF-8 encodings for the letter e with an acute accent: Form C consists of C3 A9, while Form D is 65 CC 81. Thus, the common word café can be encoded in UTF-8 using either

63 61 66 C3 A9 (Form C), or
63 61 66 65 CC 81 (Form D).

When you’re searching a text document for that word, wouldn’t you want to find both forms? This becomes even more critical in file systems. Should they treat those two names as equivalents, or allow you to name two files in the same folder using the two different forms?

One common approach is to normalise Form C to D, which is used in both HFS+ and APFS, but other file systems may normalise to Form C instead, and others may not perform any normalisation at all. The results can be disastrous, as there’s still no consensus as to the best approach, and it’s almost unheard of to declare which if any normalisation has been performed on Unicode strings.

As I wrote at the start, no computer text is both plain and simple, and most turns out to be far more complex than you envisaged.

9Comments

Add yours

1

Duncan on January 3, 2026 at 5:45 pm

In your ‘Related’ links:

Last Week on my Mac: Creaky old internet

That was written nine years ago and shows the extent of how this encoding mess impacts other areas of technology. It’s a miracle that humans are able to get anything done at all with these sorts of intractable inconsistencies.

LikeLiked by 1 person
- 2
  
  hoakley on January 3, 2026 at 7:35 pm
  
  Thank you, Duncan. Perhaps I should point out that I don’t pick the links offered there, but in this case it’s most appropriate.
  Howard.
  
  LikeLike
3

Yyzguy on January 3, 2026 at 9:10 pm

I frequently wonder how long these difficult to fix retroactively problems will plague future generations.

LikeLiked by 1 person
4

jtkelso on January 4, 2026 at 1:36 am

Thank you for an informative article. I always enjoy and learn from your writing.

Might you be interested in expanding on this article to explain how emojis are embedded in text?

LikeLiked by 1 person
- 5
  
  hoakley on January 4, 2026 at 10:40 pm
  
  Emoji are Unicode characters, just like Roman letters, math symbols, and all the many other ‘code points’ in Unicode. So they are text, just the same.
  Howard.
  
  LikeLike
  - 6
    
    Markus on January 11, 2026 at 9:22 pm
    
    Any more modern programming language can handle Unicode chars. This allows for really funny variable names and other user defined words.
    
    In Java:
    
    public class Hot☀️ {….}
    
    Hot☀️ 🔥 = ….;
    
    🔥.shine();…. etc
    
    LikeLiked by 1 person
    - 7
      
      hoakley on January 11, 2026 at 10:00 pm
      
      Thank you, Markus. I tend to avoid using Unicode in code, where it can readily become a distraction, as well as being slow to input. Where it does come into its own is in the log: there are several fine examples of its use by subsystems in macOS, most notably in iCloud.
      Howard.
      
      LikeLike
8

ericrfmwp on January 4, 2026 at 3:59 am

Oh the painful memories… writing software in the ’80s and beyond that had to run in multiple languages. I was branded as the one who knew encoding issues and it was hard to get away from them. But we’ve come a long way. Unicode has its own headaches, but at least it’s a standard. Having characters that can be represented by multiple Unicode codepoints was my favorite.

In any case, thanks as always for the article Howard!

LikeLiked by 1 person
- 9
  
  hoakley on January 4, 2026 at 10:40 pm
  
  Thank you. I too remember the pain.
  Howard.
  
  LikeLike

Before Unicode

Unicode

Endianness and BOMs

Normalisation

Share this:

Related