So, you might think that text is simply text. Well, think again. In this series of blog posts we shall descend to the byte level, study how text is actually represented by computers, and discuss how this impacts your integration solutions.
What is encoding?
Encoding is the way a computer stores text as raw binary data. In order to read text data properly, you have to know which encoding was used to store it, and then use that same encoding to interpret the binary data in order to retrieve the original text. Now you’re probably thinking, “that doesn’t sound so bad, surely there are just a couple of different encodings, and surely all text data contains information about which encoding is used, right?” Well, the answers to those questions are unfortunately not that simple, which is why encoding can be such a nightmare to deal with for developers.
What is text?
What text actually is depends on the context. When stored or in transit somewhere, text is simply a piece of binary data – the same as any other kind of data. At its most basic level, it’s a long row of zeroes and ones. When it’s being actively worked on by a computer it’s still binary data, but it’s interpreted by the system as individual characters, and in many cases converted into another binary representation while it’s being processed. This representation is called Unicode.
A brief introduction to Unicode
Back in 1988 digital data processing was becoming more and more prevalent, but the market was still extremely fragmented with every supplier using their own proprietary non-standardized solutions for most things. As a result, inter-compatibility between different computer systems was virtually non-existent, and sending data from one system to another was often very challenging. At this time, an attempt was made to try and stem the flow of emerging encoding problems by introducing a standardized common character set known as Unicode. This way, all the different encodings in use could at least be mapped to a common set of characters, so there wouldn’t be any doubt as to which character a given code was supposed to represent.
From the Wikipedia article for Unicode:
“Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of May 2019 the most recent version, Unicode 12.1, contains a repertoire of 137,994 characters covering 150 modern and historic scripts, as well as multiple symbol sets and emoji.
In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font, or style) to other software, such as a web browser or word processor.”
The Unicode character set is not an encoding itself but is merely a standardized set of all characters that anyone is likely to encounter in a data file. The Unicode standard contains a number of actual encodings as well. Common to all of these, as opposed to most other forms of text encoding, is that they support the entire Unicode character set.
While Unicode did fix some of the problems in having an overabundance of co-existing character encodings, it did not solve all of them. For one thing, the adoption of the accompanying encoding systems was slow, and is still far from universal. For another, even though having a common character set to map encodings to was certainly helpful, it did not change the unfortunate fact that many types of textual data do not contain any information about which encoding system was used to produce them.
So, how does encoding work?
Right, let’s get down into the nitty-gritty details. What’s actually stored when you save a text-based file? First, we’ll take a look at one of the oldest and most simple encodings, ASCII. Here is an excerpt of the Wikipedia article for ASCII:
“Originally based on the English alphabet, ASCII encodes 128 specified characters into seven-bit integers as shown by the ASCII chart above. Ninety-five of the encoded characters are printable: these include the digits 0 to 9, lowercase letters a to z, uppercase letters A to Z, and punctuation symbols. In addition, the original ASCII specification included 33 non-printing control codes which originated with Teletype machines; most of these are now obsolete, although a few are still commonly used, such as the carriage return, line feed and tab codes.”
As ASCII was developed in the US and based on the English alphabet, it only contains the standard English characters. This means that text containing non-English characters (such as accented letters, or special letters used in other languages) cannot be accurately encoded in ASCII without changing the special characters to English standard ones. ASCII was designed using 7-bit codes to represent the characters it encoded, but because all modern computers use bytes (8 bits) as their smallest memory unit, ASCII characters are now stored using 8 bits per character. The first bit is simply not used.
The entire ASCII encoding standard looks like this:
Now, let’s look at some example to see how these texts would be encoded in the ASCII standard. Instead of writing the binary representations of longer texts in full, we will use hexadecimal notation for the binary data.
When you open an ASCII encoded text file in a text editor, the program reads each byte of the file and looks up the value in an ASCII table to determine which character to show for that byte.
ASCII is a very limited encoding though. It only contains 95 printable characters and can therefore only be used to encode those characters. If you have textual data that contain more characters than those 95 printable characters, you will have to use another encoding.
Those are the basics of how encoding works. In the next part of the series, we’ll look at some different encodings and how they differ from one another, which you can find here Encoding 101 - Part 2.