In my recent account of Morse code, I remarked on how efficient the code is, even though it was devised a century before information theory was invented.
This article revisits the task which Morse and his team tackled, to see whether it is possible to devise an even more efficient code.
For the sake of this little study, I will try to design an encoding similar to that of Morse code, consisting of only dots and dashes (which I will also represent as 0 and 1 bits respectively), for 26 letters, 10 numerals, and 4 punctuation marks (‘.’, ‘,’, ‘?’, and parentheses) in standard English; these do not include a character representing a space, which I will assume is conveyed by timing (see below). The aim is to encode text in the most efficient manner for storage or transmission, in terms of compactness.
Importance of character frequency
English letters do not occur with random frequencies, but if their occurrence was completely random, there would be no scope for improvement over a simple enumeration of the characters. Thus A could be represented by 0, B by 1, and so on up to 39. The total number of bits required to encode each character would therefore be between 5 and 6 (actually log₂ 40 = 5.322 bits).
At the other extreme, if English consisted almost entirely of a single character such as A, with infrequent random occurrence of other characters, we could encode A using the single bit 1, and the other characters using 0 as the first bit, followed by up to 5 further bits to specify which other character. The average number of bits required to encode such bizarre text would therefore tend towards 1, but always exceed 1 because of the longer encodings required for characters other than A.
If we use codes constructed so that the most frequently-occurring characters in English are assigned the shortest code, then we should create the most compact encoding, just as was attempted by Morse and his team. This is the technique of variable-length encoding, known after the person who developed its first theoretical and practical exposition as Huffmann encoding.
There is a slight cheat here: I am going to work along the same lines as Morse code, which inserts a space between each character transmitted, and a longer space between each word. This is fine for a carefully timed stream of signals, as is produced by an experienced Morse operator.
In other circumstances, such as the storage of data in binary form in a computer, the boundary between characters has to be marked, for example using a ‘stop bit’ or a special character such as a series of zeros. These impose additional overhead, which I here ignore.
Estimating character frequency
The classical analysis of letter frequency in English text was carried out by Mayzner on 20,000 words of English, but since then Peter Norvig has analysed more than 3.5 trillion letters in Google’s books. From the most frequent, they are:
To those we need to add numerals and punction marks, which are usually assumed to be among the least frequent.
Depending on the text to be encoded, actual frequencies could be quite different, of course: use of the ‘Q’ code in radio makes that letter considerably more common than in normal written English (and without a following U), and the presence of numerals in callsigns would also be significant. However for the sake of this study, I will ignore those effects, and use the encoding tree shown below, based on letter frequencies.
I have deliberately kept as many letter codes the same as in traditional Morse, so that any change would be as small as necessary for improved efficiency, as shown in the table below. Note that as I have not considered frequencies of numerals or punctuation marks, the codes for those are outline ideas rather than joined-up thoughts.
There is another obvious way in which we might be able to improve the efficiency of the coding of English text: by coding multiple letters, words, or blocks of text in one go.
Simple run length encoding, as used widely in image compression, is of no use in English, where very few sequences of the same character occur. However analysis of letter groups, words, and even phrases could allow them to be encoded.
Norvig found that the most common words in the English language,
THE OF AND TO IN A IS THAT
and so on, in decreasing order of frequency, account for slightly over 20% of words found, and there could be some benefit to coding some of those as ‘prosigns’ with 5 or even 6 bits (apart from the word
A, of course!). However because they already use the most frequent letters, there seems little to be gained: for example, coded letter-by-letter,
THE uses a total of 5 bits, the same length as a prosign.
James Stone (2015) considers in detail the savings which could be achieved by encoding progressively longer sequences of characters, and concludes that they are optimal with 10 character sequences, where they would require only 1.8 bits per character. However, such a code would require a human to memorise over a quarter of a million different codes if it were to replace Morse code, and could only be used by computers.
Using this new Morse-like encoding would, on the Google books data and ignoring numerals and punctuation, result in an average bit length per character of 2.39, which is significantly better than would be achieved using standard International Morse Code which would result in an average bit length per character of 2.53, although still poorer than Stone’s optimal 1.8.
For comparison, encoding using ASCII requires 8 bits per character, but does of course provide a much larger character set. However, an operator who normally was capable of keying Morse at an average rate of 150 characters per minute would struggle to manage 50 per minute if they used a Morse-like version of ASCII with each character eight dots and dashes long. But using this ‘improved’ version of Morse, as shown here, should allow them to key at an average rate of almost 160 characters per minute: a small but worthwile gain in efficiency. Of course that could vanish when applied to different message content.
It is remarkable that Vail, Morse and Henry achieved such amazing efficiency a hundred years before the start of the science of information and communication theory.
Stone JV (2015) Information Theory, A Tutorial Introduction, Sebtel Press. ISBN 978 0 956 37285 7. Particularly Chapter 3.