Encoding 101

Part 3 of 4: Error Scenarios & Handling

Asger Smidt

Asger Smidt

Integration Consultant

6 min read

20 Jan 2020

Continued from part 2.

Now we come to the central issue with having multiple encodings, and why it matters so much to integration solutions. We are going to look at exactly what happens to the data when there is a mismatch in the encodings used to read and write it.

Scenario: We have an integration which receives text data as Windows-1252, converts it to UTF-8 and sends it on to a target system. We will use the same word as before, “Naïveté” to illustrate.

First, let us examine how it is supposed to work. We will look at each step of the process, and what the data looks like every step of the way.

But what happens then, when the integration receives data in a different encoding than the one it expects? Well, bad things happen:

So now the target system has the text “Naïveté” saved instead of “Naïveté”. The problem is that the individual byte values used in the multi-byte encodings of UTF-8 are all valid Windows-1252 character codes, so when interpreted as Windows-1252, each UTF-8 2-byte character becomes two Unicode characters, matching the equivalent Windows-1252 single-byte codes. When the Unicode text string is then converted back into a UTF-8 representation, each of those characters get encoded as the equivalent UTF-8 code points. Since all 4 of those characters (well, 3 characters, but one of them is used twice) are 2-byte characters in UTF-8, the binary representation of the string is now significantly longer, as well as being wrong.

The opposite scenario carries different problems with it:

Scenario: We have an integration which receives text data as UTF-8, converts it to Windows-1252 and sends it on to a target system. We will use the same word as before, “Naïveté” to illustrate.

Again, we start with how it is supposed to work:

But what if the text the integration receives is not UTF-8 as it expects, but Windows-1252 instead? This way around, the answer is a bit more complicated than before. In this scenario several different things can happen, depending on precisely how the integration system handles this specific situation. The problem here is that the codes used in Windows-1252 to represent the ï and é characters are not valid character codes in UTF-8. This means that they cannot be mapped directly to Unicode characters using the UTF-8 encoding. When trying to do so, one of five different things may happen:

  • Reject: The system halts the processing of the data and throws an error.
  • Remove: The unrecognized characters are removed from the string.
  • Replace: The unrecognized characters are replaced by the Unicode replacement character (�) which, when rendered as text on a screen is usually depicted as either a black diamond with a question mark in it, or the empty outline of a square, depending on the font used.
  • Remember: The same as Remove or Replace, but the unrecognized character codes are remembered and are still saved along with the recognized ones.
  • Re-interpret: When faced with a code that is not valid according to the encoding used, the system may attempt to interpret the code using another likely encoding.

Which of these five options gets enacted depends entirely on how the specific integration is programmed. Now let us see what the five possible results are:

As a matter of fact, there are more than 5 possible results. In the “Replace” scenario, you will note that what happens in the integration says, “see below”. That is because it is essentially the same thing as the bigger scenario outlined above, only reversed. The integration has to try to encode a string containing Unicode characters that have no representation in the encoding used. As above, there are different ways it can go about this, depending on exactly how it was programmed. Let us look at the different variations of that scenario, just from the integration and onwards:

There is no “Re-interpret” scenario in this case, since that only makes sense when decoding text, not when encoding it. The “Remember” scenario is valid, but is the same as in the table above this one.

Finally, to illustrate just how badly text data can get mangled if there is disagreement on which encodings to use, let us look at what happens if the source and target systems both use Windows-1252, but the integration uses UTF-8 both ingoing and outgoing. We will assume that the integration uses the “Replace” option for unknown character codes:

Yikes. That might certainly cause problems later on, when that data is supposed to be used for something functional. It might have been the name of a customer or a delivery address, and this may prevent a paying customer from having products delivered.

Next time we will look at what we can do to prevent encoding problems from arising in the first place.

Continued in part 4.