JS not enabled The wonder of UTF-8 | Daily Manchester
  • The wonder of UTF-8
The wonder of UTF-8

UTF-8

When computers were created there needed to be an indexing system to what binary code corresponds to what characters. Although different countries created their own, sometimes only needing 8 bits, or in the case of countries like Japan they needed 16 bits to index their logographic system of writing (that contains 80,000 Kanji in total).

This was usually fine sending files across the country, although sometimes it was not, due to some different companies employing different indexing systems. But when the internet was started to be used more and files were being sent cross country to different machines, the documents usually ended up in gibberish.

For this reason the Unicode consortium was created, to have every character that exists into one indexed system, even Emoji’s which you can read about on my article here. But there were two issues that needed addressing, one being delt with rather smartly. The first issue was that it needed to be backward compatible with ASCII (used before UTF-8 mainly in the USA and England), this was simple by starting the index of the alphabet at 65 (01000001 in binary being “a”, then “A” being 01100001).

The second issue was more of an issue, there are thousands of indexed UTF-8 characters, which you cannot index in only 8 bits (holding only 256 possible characters). Now you could just extend it to 2 bytes (16 bits, which has 65536 possible characters) which the Japanese had (ideographs), or more (24 bits). But if you were only using English characters you’d have half the spaces wasted in zeros (since you only need 8 its which ASCII has).

This issue was combated by having how many bytes were part of one character in front of a byte, so you could put a “1” at the start of the byte then following “1”s to indicate how many, so “110xxxxx 10xxxxxx” or “1110xxxx 10xxxxxx 10xxxxxx”. The “x” being the indexing numbers (translated into the number of the character).

Now we have a system that can be used to type any character in many languages (some being dead), as well as uses the minimum amount of bits (none being redundant).

I recommend watching Tom Scott’s video on Comuterphile about “Characters, Symbols and the Unicode Miracle“.

Leave a Reply

Your email address will not be published. Required fields are marked *