Part 4 of 4: Encoding Best Practices
Can we reliably detect the encoding used to produce a given lump of binary text data? No, unfortunately not. We can make an educated guess (or write code to the same effect), but it won’t be reliable. There simply isn’t enough information to be 100% certain of the result. We could write a method to do it and refine it with new logic based on the specific data involved every time it got it wrong, but it would only be valid for the specific integration scenario we wrote it for. It might eventually be more than 99% accurate, but it is not economically feasible to maintain and update such complicated code logic for each and every integration we do.
What we can do is to follow a set of best practices to minimize the possibility of encoding errors. I will outline those here.
There are many text-based data formats which either contain information about their encoding, or at least provide the option to include it. All EDIFACT formats explicitly state which encoding each file uses for instance. Likewise, XML files can contain (although it is not mandatory) an XML declaration with the name of the encoding they use, so all XML based formats can do the same. Using formats like these and using the options they provide for eliminating encoding errors, can alleviate a lot of the potential issues you might otherwise be faced with. Whether you are on the sending or receiving end of an integration, always seek to use proper data formats.
There are many ways to send data from one system to another. Some of these offer the sender the possibility of including information about the data they are sending, such as which encoding it uses. When transferring data via HTTP (e.g. sending data to a webservice), the HTTP Content-Type header can be used to indicate the encoding of the data.
If you have no other options for exchanging data than by sending or receiving plain text files of some kind (e.g. CSV files, which contain data fields separated by a separator character, often semicolon), make sure to make an agreement with the sender or receiver about which encoding should be used. As long as both parties always stick to using the encoding they have agreed upon for each integration, there won’t be any issues.
The existence and usage of multiple different text encodings frequently leads to problems, particularly in integration. A problem caused by an encoding error can be tremendously difficult to trace and solve, due to the fact that on its way from the sending to the receiving system, a given batch of data may be encoded and decoded many times. This can make it very difficult to find out where the error actually occurs.
It is therefore always wise to be aware of encoding and to be explicit about it. When planning a new exchange of data between two systems, always try to determine exactly which encoding each system uses to generate its own outgoing transmissions, and which encoding it expects for its ingoing ones. That way you probably won’t have any nasty surprises down the road.
Stay in the know.