Part 2 of 4: Windows 1252 vs. UTF-8 & more
Continued from part 1
If only the entire IT industry had agreed on a common encoding back in the day, things would be considerably easier to deal with now. However, this is not the case. Over the years, various companies have set out to solve the exact same problem (how to represent text as binary data for storage or transmission), each in their own way. As a result, an astounding number of encoding systems exist today. Frustratingly, many of them are almost identical, leading one to question the necessity for their existence even further.
Many modern encodings are based on the ASCII encoding, but extend it to include more characters. This means that text which only contains characters that are present in the ASCII standard can often also be decoded as ASCII, even though technically it is encoded using a different standard. This is because many larger encodings still use the same codes as ASCII does for the first 128 characters but beyond those characters these encoding schemes differ to a greater or lesser degree.
Let us compare two of the most common encodings used for western languages, Windows-1252 and UTF-8.
This is the default encoding used by Windows systems in most western countries. This means that text data produced by software running on such systems by default will use the Windows-1252 encoding unless explicitly set to use a different one. Some software lets the user choose which encoding to use, some is set to use a specific encoding rather than whatever encoding is the default on the system it is running on, and some leaves it up to the system itself.
Windows-1252 is a single-byte encoding, which means that each character is encoded as a single byte, the same as with ASCII. However, since Windows-1252 uses the full 8 bits of each byte for its code points (as opposed to ASCII’s 7-bit codes), it contains 256 code points compared to ASCII’s 128. The first half of the code points are identical to the ones defined in ASCII, while the second half encodes additional characters not present in the ASCII character set.
UTF-8 is an encoding from the Unicode standard. UTF stands for Unicode Transformation Format, and the 8 at the end signifies that it is an 8-bit variable encoding. What this means is that each character uses at least 8 bits for its code point, but some may use more. As with Windows-1252, the first 128 code points are identical to ASCII, but above that the two encodings differ considerably. While Windows-1252 only contains 256 code points altogether, UTF-8 has code points for the entire Unicode character set. The way this is handled is to define some of the byte values above 127 as prefixes for further byte values. For instance, the copyright symbol (©) is encoded as C2 A9, and the pound sign (£) is encoded as C2 A3. Because the C2 byte is designed as a prefix byte, this opens up an additional 256 2-byte code points with C2 as the first byte.
This design means that most of the common characters used in western languages only take up a single byte of space, while the multi-byte encodings are used less frequently. As a result, UTF-8 is able to encode any character while still keeping data size relatively small. This is good for both permanent storage (small file sizes) and transmission (e.g. opening a web page). Because of this, UTF-8 is now by far the dominant encoding in use on the World Wide Web and accounted for 94% of all web pages as of September 2019.
Let us look at a specific example of how these two encodings differ from one another. We will use the word “Naïveté”, which contains two non-ASCII characters (it has alternative spellings without those, but the example is a recognized legitimate spelling of the word in English).
As we can see, the characters ï and é exist in both encodings, but are encoded in two different ways. In Windows-1252, all characters are encoded using a single byte and therefore the encoding only contains 256 characters altogether. In UTF-8 however, those two characters are ones that are encoded using 2 bytes each. As a result, the word takes up two bytes more using the UTF-8 encoding than it does using the Windows-1252 encoding.
So, different encodings treat some characters differently from one another. Next time we will look at how this can cause problems for us.
Continued in part 3.
Stay in the know.