Last Week on my Mac: Creaky old internet

I seem to have spent, perhaps wasted, an inordinate amount of time this week wrestling with problems that have a common cause: the internet’s failure to progress.

Since I first became connected at 9.6 kilobits per second using a dial-up modem, to today when many enjoy hundreds of megabits per second over constant connections, internet standards have faithfully stuck with the ancient and flawed. Much of the standardisation on which today’s internet is based is now more than twenty years old.

The issue which has taken so much of my time on this occasion has been Uniform Resource Locators, URLs, which most of us see as ‘web addresses’, although they are not quite the same. They are the universal way in which we, our Macs, and other devices, refer to remote resources such as web pages. In this context, they consist of three parts:

  1. a protocol, here http or https,
  2. a hostname, such as http://www.apple.com,
  3. a file name, such as index.html.

Having originated largely within English-speaking nations, the original system of URLs was based entirely on the English (‘Roman’) alphabet, together with a range of punctuation and special characters. This was fine for URLs like
http://www.apple.com/index.html
but couldn’t even cope with the most common accented characters in Roman-based languages used worldwide, such as Spanish, French, and German, let alone the likes of Chinese or Japanese.

In 1991, the advent of Unicode started to change that, but URLs remained steadfastly tied to their pure English character set, known by the euphemism of ASCII. I refer to this as a euphemism, because the abbreviation includes the words “Information Interchange”, and any code which excludes the languages of most of the world’s population is hardly fit for that purpose: it cannot even encode the native language (Spanish) of more than forty million residents of the USA.

The rest of the world has moved on, thanks to Unicode, which has eventually pervaded the encoding of text throughout Apple’s operating systems, Windows, and almost every active document standard apart from Adobe PDF.

But URLs remain a combination of two incompatible fudges which still do their utmost to pretend that Unicode doesn’t really exist.

Hostnames can now use Unicode characters, such as
www.møller.dk
but as the name service DNS still doesn’t handle Unicode, they have to be encoded using a kludge named (entirely appropriately) Punycode. This uses good old ASCII to encode Unicode characters in an entirely perverse way. For example, that hostname would be encoded as
www.xn--mller-vua.dk
I leave it as an exercise for readers to deduce how this encoding works, or you can cheat and convert your own hostnames online here.

File names are encoded using quite a different system, which is more transparent, but verbose. Characters which are outside the basic ASCII set are encoded using an escape character % and the ASCII hexadecimal numbers of their Unicode UTF-8 encoding. So the Unicode file name of
rugbrød-med-gær.html
would be encoded as
rugbr%C3%B8d-med-g%C3%A6r.html
making my entire hypothetical URL
http://www.xn--mller-vua.dk/rugbr%C3%B8d-med-g%C3%A6r.html
instead of the Unicode
http://www.møller.dk/rugbrød-med-gær.html

Amazingly, one of the claimed justifications for continuing with this is to prevent users from being fooled to follow links to malicious sites which substitute visually similar Unicode characters, as in appĺę.com, which is surely a problem for domain registrars to address. This is particularly true now, as all browsers allow you to enter full Unicode URLs, known more correctly as Internationalised Resource Identifiers (IRI), and automatically convert them into real URLs which the internet can cope with.

There’s the rub: we now expect our apps to accept IRIs, which we think are URLs, and leave it to the app to handle the conversion. Like all such assumed conversions, it doesn’t always happen, and then it results in problems which can be amazingly opaque. I have already explained the problem that I walked into, and eventually solved, which has wasted many hours of my time this week.

Those who determine the standards used by the internet are well aware of the problem, and many other issues resulting from this mess. The solution is really simple: switch DNS and the rest of URL resolution to use Unicode text. But that is unconscionable, because it would not be backward-compatible. So we struggle on for further years, kludging through, just so that what we use today is backward-compatible with the myopic view of the internet which prevailed in the 1980s.

It’s almost as bad as making all processors run in 16-bit mode even though they are perfectly capable of running in 64-bit mode, just so that you can still run 16-bit software.

In twenty years time, is the internet still not going to have switched from ASCII to Unicode? Surely it must some day, so let’s name a day soon, when the internet finally catches up with the twenty-first century, and its use by the rest of the world outside the US.