Continued from Encoding 101 part 2
In this third part, we’ll dive into some of the kinds of errors that can arise from interpreting data using a different encoding than the one which was used to make them.
Now we’ve come to the central issue with having multiple encodings and why it matters so much to integration solutions. We’re going to look at exactly what happens to the data when there is a mismatch in the encodings used to read and write it.
Scenario: We have an integration which receives text data as Windows-1252, converts it to UTF-8 and sends it on to a target system. We’ll use the same word as before, “Naïveté” to illustrate.
First, let’s examine how it’s supposed to work and go over the process step by step.
But what happens then, when the integration receives data in a different encoding than the expected? Well, you can see for yourself in the schedule below.
As you can see, the target system has the text “NaÃ¯vetÃ©” saved instead of “Naïveté”. The problem is that the individual byte values used in the multi-byte encodings of UTF-8 are all valid Windows-1252 character codes. So, when interpreted as Windows-1252, each UTF-8 2-byte character becomes two Unicode characters, matching the equivalent Windows-1252 single-byte codes. When the Unicode text string is then converted back into a UTF-8 representation, each of those characters get encoded as the equivalent UTF-8 code points. Since all 4 of those characters (well, 3 characters, but one of them is used twice) are 2-byte characters in UTF-8, the binary representation of the string is now significantly longer and wrong.
The opposite scenario also causes different problems.
Scenario: We have an integration which receives text data as UTF-8, converts it to Windows-1252 and sends it on to a target system. We’ll use the same word as before, “Naïveté” to illustrate again.
Here’s how it’s supposed to work:
But what if the text the integration receives isn’t UTF-8 as it expects, but Windows-1252 instead? The answer for this is a bit more complicated than before. In this scenario several things can happen, depending on precisely how the integration system handles this specific situation. The problem here is that the codes used in Windows-1252 to represent the ï and é characters are not valid character codes in UTF-8. This means that they can’t be mapped directly to Unicode characters using the UTF-8 encoding. When trying to do so, one of five things might happen:
- Reject: The system halts the processing of the data and throws an error.
- Remove: The unrecognized characters are removed from the string.
- Replace: The unrecognized characters are replaced by the Unicode replacement character (�), which when rendered as text on a screen is usually depicted as either a blank diamond with a question mark in it, or the empty outline of a a square, depending on the font used.
- Remember: The same as Remove or Replace, but the unrecognized character codes are remembered and are still saved along with the recognized ones.
- Re-interpret: When faced with a code that's not valid according to the encoding used, the system may attempt to interpret the code using another likely encoding.
These five scenarios depend entirely on how the specific integration is programmed. Now let’s take a look at the outcomes of these five scenarios:
As a matter of fact, there are more than 5 possible results. In the “Replace” scenario, you’ll note that what happens in the integration says, “see below”. That’s because it’s essentially the same thing as the bigger scenario outlined above, only reversed. The integration must try to encode a string containing Unicode characters that have no representation in the encoding used. As above, there are different ways it can go about this, depending on exactly how it was programmed. Let’s look at the different variations of that scenario, just from the integration and onwards.
There’s no “Re-interpret” or “Remember” scenarios in this case, since that only makes sense when decoding text, not when encoding it.
Finally, to illustrate just how badly text data can get mangled if there’s a disagreement on which encodings to use, let’s look at what happens if the source and target systems both use Windows-1252, but the integration uses UTF-8 both ingoing and outgoing. We’ll assume that the integration uses the “Replace” option for unknown character codes.
This might certainly cause problems later on when that data is supposed to be used for something functional. For example, the name of a customer or a delivery address, which could have prevented a paying customer from having ordered products delivered.
Next time we’ll take a look at what we can do to prevent encoding problems from arising in the first place.