19 Nov 2012, 00:27

Chinese tweets and the basic idea behind character encoding

Tonight my wife and I were on Reddit and we found a TED Talk by Michael Anti called “Behind the Great Firewall of China." At one point during his talk, Anti mentioned that the information-density of written Mandarin is much higher than English and demonstrated with the following image:

Tweet comparison of Mandarin and English

The idea of course is that in just 140 characters you can get a lot of meaning across in Chinese, especially compared to what you can convey in similar space using English. Interesting enough. I pointed out, though, that while the written information-density of Mandarin is quite high compared to English, it’s likely a wash when you consider the digital information-density, since you can express most anything in English using 7 bits in ASCII while Mandarin requires several more bits. My wife’s not programmer, so I tried to explain what I meant, and did a very poor job of it. She gave me a couple tries, but I only succeeded in making a seemingly boring subject sound even more boring.

I’m going to give it another shot here, mostly as an exercise in ELI5.

In layman’s terms

Let’s pretend you want to write a letter to a friend. This friend has complained in the past that your handwriting is atrocious and she can barely read the notes you send her. You prefer to think of your handwriting as quirky and artistic, so this annoys you a little, but you knows she’s right. So, you decide to type her letter on your old typewriter (the printer’s out of ink and the cartridges are too expensive anyway). You get the typewriter out, only to discover that the letter keys don’t work; you try to type out the salutation but nothing happens. However, the numbers still work. Being a stubborn sort of person and not wanting to give up, you devise a plan: you’ll type the letter out using an easy-to-guess numeric code: 1=A, 2=B, 3=C, etc. Punctuation and things like that are too complicated for this endeavor, so you’ll just have to get by with using the numbers 1 through 26 to communicate. Simple enough, and you’re confident that your friend will be able to decipher it.

So you get to work typing. Right away you hit a snag: you want to start the letter with “Dear Guinevere”, but you realize you have no way of indicating the difference between capital and lowercase letters. This is annoying, but you decide to just live with it, although you think to yourself that a more perfect version of this code would include both lower- and upper- case.

Anyway, you write out your letter using only numbers. In it you happen to mention that since the two of you spent that summer in China two years ago, your Mandarin has really started to slip. You’ve got some friends who speak it that live nearby and you practice with them sometimes, but you rarely have the chance to work on your writing skills. You finish the letter and look it over. This is when you realize your second big problem. Even though the spacebar works, the numbers kind of run together, so it’s not always obvious what you were trying to say. For example, “Dear Guinevere” translates to “45118 7219145225185”. That’s actually pretty complicated; if you’re decoding the first word, it could be 4-5-1-1-8, 4-5-11-8, or 4-5-1-18, which would decode to “deaah”, “dekh”, or “dear”, respectively. In this case, you can figure out which one is meant, but it takes some tedious guessing and checking. And consider how difficult it gets when decoding a long, uncommon name like Guinevere! You really want the code to be straightforward, since after all, you’re doing this to make communicating to your friend more efficient. So, you come up with a fix. Instead of using the pattern 1=a, 2=b, etc., you use 01=a, 02=b, 03=c, etc. This way, the doubt while decoding is eliminated. You know that every two numbers, starting from the beginning of the word, represents one and only one letter. The word “dear” is now 04050118. A little longer, but perfectly clear as to where the letter-breaks are. You quickly rewrite the letter using this new code, and mail it off.

The next week, you receive a reply from your friend. In spite of having to decode your letter from numbers, she still found it much easier to read than your typical chicken scratch, and insists that you send all of your letters in encoded form from now on. I guess your handwriting was really quite bad. Just go with it.

She also has an idea: in order to practice your written Chinese, the two of you should start writing your letters in Mandarin! This actually makes the encoding even more important, because even if your typewriter was working normally instead of being in “numbers only” mode, it doesn’t have Chinese characters. So, you can write out your letter in Chinese with a pen, and then encode it, using a similar encoding as before, except now each number corresponds to a different Chinese character. You’re also excited by the fact that, since Chinese characters are so information-dense, you can write a long, meaningful letter to your friend without having to encode as many characters.

However, once you start encoding the Chinese into the numbers code, you realize something. Adding the leading zeroes to each character is even more important now than it was in English. That’s because you can have such large numbers because of the thousands of different characters in Chinese. For example, one word might be represented by the number “2173” because it’s the two thousand one hudred seventy-third character in your list. But if you’re decoding “2173”, it could be any of the following:

2-1-7-3

2-1-73

2-173

2173

21-7-3

21-73

217-3

2-17-3

So, you really need to use your leading zeroes in this case to avoid confusion, but that means that 2173 (which we meant to be 21-7-3) gets encoded to 002100070003. This drastically reduces the space savings you thought you’d get by writing in Chinese. It’s annoying but completely necessary to write “0003” every time you need to write “3”; otherwise the numbers would all run together and be too confusing to decode. But, you’re getting to practice your written Chinese and stay in touch with your friend, so you get over it and mail the letter.

This is at least a little bit similar to how computers send text to each other. They have to encode it to numbers, because that’s the language they speak. In order to be able to read the streams of numbers in the right context, they have to have an agreed upon length for each number, like say, 8 digits. Then they know that each 8-digit number represents a character that they can then display on a screen. So, there you have it; that’s an extremely over-simplified but hopefully helpful introduction to character encoding.