February 07, 2013

The Challenges of Character Encoding... in 2013

This article is going to be a bit of a rant.

Back when I was a lad, Unicode wasn't as popular as it is now. The world lived in the Dark Ages, where each text character was usually represented by a single byte, regardless of the language. A single byte can take up to 256 different values, meaning that on its own, it could be used to represent at most 256 different characters. People wanted to communicate to each other in different languages, but the number of characters in all modern languages was far more than 256.  This posed a problem.

A commonly accepted solution was to agree that the first 128 character codes were fixed (the ASCII character set), while the last 128 were unspecified -- their values depended on the context. A swarm of character encodings emerged, each handling these last 128 characters differently. For my native Russian, there were at least two such encodings: KOI-8 and CP1251. When you got text in Russian, you typically had to guess which encoding it was in. If you got it wrong, all you saw was gibberish (and any English text, since that was handled the same regardless of the encoding). It was 50/50, so not too bad. There were also some utilities that helped you guess the encoding.

A significant limitation, of course, was that you couldn't freely mix languages in text. For example, working with Russian and Greek text in the same file was not possible. Greek used a different encoding, which wasn't compatible with KOI-8 nor CP1251. You could see Russian or Greek at any one time, but never both.

Fast-forward to 2013. Unicode has become almost universally accepted. It solved the problems above, introduced several new ones, but overall, made the world a better place. The days of staring at gibberish trying to work out which encoding it is in were gone. Things just worked. Or so I thought.

I recently had the pleasure of playing correspondence chess with my dad using a chess server. While the game itself was quite entertaining (I lost miserably), I quickly noticed that sending Unicode messages wasn't working -- ASCII characters got through, non-ASCII characters got replaced by a question mark. This forced us to write in transliterated Russian, which I hate with a passion:


What was obvious that while the Web site was capable of displaying Cyrillic (which I managed to enter using HTML character escapes, thanks to the site admin), it was ruthlessly clobbering the text after it was being submitted. After looking at it more closely, I realized the input textarea was part of a POST form, and clicking on the "Msg" button submitted the form. The message was, therefore, part of the POST payload. I confirmed that it was successfully URL-encoded and still readable by using Chrome's Developer tools. After that, all trace of the message is lost, but one thing is clear: the non-ASCII characters in the message died a horrible death.



While encountering a fairly popular site that could not properly handle Unicode in 2013 was amusing, what was even more amusing was my dialogue with the administrator of the site. While I do believe they were genuinely trying to help, their firm belief that a server-side encoding problem could be fixed by modifying the client configuration was disconcerting, to say the least. The more I talked to them, the more I understood that they had no idea how character encoding works. Unfortunately, they also had little desire to reason about it, which led to a rather tense dialogue:
... the language that you use to handle message input is controlled by your local configuration and this is something over which we have no control. All of our pages, in common with around 75% of all online content from many diverse sites, are configured to accept input that conforms with UTF-8 standards. As I have tried to point out, there may well be a local issue relevant to your local configuration, codepage setting, default character coding and so on. These are not issues for which we, as a UK based Chess playing site, normally provide support. In this case, we have offered more support and advice than would have been provided by many of our competitors.
I edited the quote above liberally, for brevity. It can, however, be summarized by a single word in the King's English: bollocks. What's really happening is this: after the form is submitted, the URL-encoded message is retrieved from the POST payload, and encoded into one of the antique character encodings from the Dark Ages. Since these encodings only represent characters with codes up to 255, and the Unicode Cyrillic characters sit around the 1000 mark, there's simply no way to represent those characters anymore. Whatever performs the encoding replaces such characters with a placeholder character, which in this case happens to be a question mark.


I mashed up some JavaScript to demonstrate the problem.  While I couldn't be bothered doing a full form POST, the code demonstrates the essence of the problem.  It's below.

Here's a live version you can interact with.  Case closed.


clobber the submitted text

Had I been a more patient man, I would have persevered against the onslaught of ignorance and carried on my crusade for working Unicode input to its glorious end. Instead, I just opened an account on another chess server.