UTF8, extremely clever coding of characters

SUMit Roster Software > Nut's > May 2005 > Wide

UTF8	Monday, 30^th May 2005

UTF8 is amazing. It is one way of coding all Roman characters, but also Russian, Greek and Arabic. Those are different characters, but still similar sized alphabets, a few dozens of characters each.

I'm truly surprised that all one and the same page can host all these characters. How is this possible? There is an end to the 256 possible combinations of the ASCII code, isn't there? It is even more odd when the same page also contains Chinese and Japanese characters, languages with thousands of characters.

How does UTF8 work?
How is it possible to have 1 byte ASCII characters sharing a page with multi byte characters?
How in earth can this work?

What is behind the UTF8 encoding?

Have a look at the HTML source of columbia.edu/kermit/utf8.html with an ASCII editor and see just strange characters. It looks like hexadecimal data, but how is it encoded?

Does every characters has it's own size byte? No, ASCII characters just seem to stand on their one.
What flags the start and the end of a multi byte character? Is the a start and stop sign around similar size characters? No, so signs to discover.
How can this encoding be so terribly efficient? How does UTF8 distinguish characters of different lengths?

The screen does show characters as a magician and I just can't figure out how. Wow, UTF8 is magic!

The more I study this, the more intriguing it gets. It seems impossible for the second byte of a Japanese character to look like a carriage return or a quote (x'22'). A test PHP script generates values with x'2D' as last byte. It is odd, but UTF8 seems to skip these dangerous values. I fail to get a dangerous character on screen. A copy paste of generated browser text to a hex editor does not show any dangerous values. Huh? Where did the x'2D' value go for the last byte?

Now this mystery must be solved. How does this UTF8 encoding work?

UTF8 encoding explained

The wikipedia explains the UTF8 encoding.

ASCII characters of one byte always start with a 0 bit.
The first byte of a size two character always starts with bits 110.
The first byte of a size three characters always starts with the bits 1110.
The first byte of a size four characters always starts with the bits 11110.
A follow up byte of a multi byte character always starts with 10.
This makes it impossible for a follow up character to look like a carriage return or a quote.

This encoding seems a bit of a waste of bits but it is very efficient with bytes.

Most ASCII characters fit in one byte. The large group of English texts needs just one byte per characters.
The most common multi byte characters of other languages fit into two or three bytes.
It is just the very exceptional characters that need 4 or more bytes. But well, that is a small minority.

Conclusion

UTF8 truly is amazingly efficient.
It looks a lot like ASCII for languages with Roman characters and uses only 2 bytes per character for most other languages.
It does not impose problems with special characters. A byte that looks like a carriage return can only be a carriage return.

Till next nut,
Nut

Special thanks to the gentlemen of 4uIT for their UTF8 research.