Continued from Encoding 101 part 3
In this final and fourth part of the series, we’ll discuss how to prevent encoding problems.
So, what can we do?
Can we reliably detect the encoding used to produce a given lump of binary text data? No, unfortunately not. We can make an educated guess (or write code to the same effect), but it won’t be reliable. There simply isn’t enough information to provide a certain result. We could write a method to do it and refine it with new logic based on the specific data involved every time it got it wrong, but it would only be valid for the specific integration scenario we wrote it for. It might eventually be more than 99% accurate, but it’s not economically feasible to maintain and update such complicated code logic for each and every integration we do.
What we can do is to follow a set of best practices to minimize the possibility of encoding errors, as outlined below.
Use proper data formats
There are many text-based data formats, which either contain information about their encoding, or at least provide the option to include it. All EDIFACT formats explicitly state which encoding each file uses for instance. Likewise, XML files can contain (although it’s not mandatory) an XML declaration with the name of the encoding they use, so all XML based formats can do the same. Using formats like these and using the options they provide for eliminating encoding errors, can prevent a lot of the potential issues you might otherwise encounter. Whether you’re on the sending or receiving end of an integration, always seek to use proper data formats.
Use structured interfaces
There are many ways to send data from one system to another. Some of these offer the sender the possibility of including information about the data they are sending, such as which encoding it uses. When transferring data via HTTP (e.g. sending data to a webservice), the HTTP Content-Type header can be used to indicate the encoding of the data.
Agree on which encoding to use
If you have no other options for exchanging data than by sending or receiving plain text files of some kind (e.g. CSV files, which contain data fields separated by a separator character, often semicolon), make sure to make an agreement with the sender or receiver about which encoding should be used. As long as both parties always stick to using the encoding they’ve agreed upon for each integration, there won’t be any issues.
The existence and usage of multiple different text encodings frequently leads to problems, particularly in integration. A problem caused by an encoding error can be tremendously difficult to trace and solve, due to the fact that on its way from the sending to the receiving system a given batch of data may be encoded and decoded many times. This can make it very difficult to find out where the error actually occurs.
It’s therefore always wise to be aware of encoding and to be explicit about it. When planning a new exchange of data between two systems, try to determine exactly which encoding each system uses to generate its own outgoing transmissions, and which encoding it expects for its ingoing ones. That way you probably won’t have any unpleasant bumps down the road.