UTF8

Monday, 30th May 2005
UTF8 is amazing. It is one way of coding all Roman characters, but also Russian, Greek and Arabic. Those are different characters, but still similar sized alphabets, a few dozens of characters each.

I'm truly surprised that all one and the same page can host all these characters. How is this possible? There is an end to the 256 possible combinations of the ASCII code, isn't there? It is even more odd when the same page also contains Chinese and Japanese characters, languages with thousands of characters.

What is behind the UTF8 encoding?

Have a look at the HTML source of columbia.edu/kermit/utf8.html with an ASCII editor and see just strange characters. It looks like hexadecimal data, but how is it encoded? The screen does show characters as a magician and I just can't figure out how. Wow, UTF8 is magic!

The more I study this, the more intriguing it gets. It seems impossible for the second byte of a Japanese character to look like a carriage return or a quote (x'22'). A test PHP script generates values with x'2D' as last byte. It is odd, but UTF8 seems to skip these dangerous values. I fail to get a dangerous character on screen. A copy paste of generated browser text to a hex editor does not show any dangerous values. Huh? Where did the x'2D' value go for the last byte?

Now this mystery must be solved. How does this UTF8 encoding work?

UTF8 encoding explained

The wikipedia explains the UTF8 encoding. This encoding seems a bit of a waste of bits but it is very efficient with bytes.

Conclusion

Till next nut,
Nut

Special thanks to the gentlemen of 4uIT for their UTF8 research.