In certain cases users like to send messages using different languages which can become quite tricky when it comes to determining how many message credits are subtracted, how many characters can be contained in a message, etc.
This table provides a brief overview over the most common languages and how browsers and mobile handsets overcome the challenge of displaying different languages in a correct and legible manner.
|Language||Character set for handset||Max characters / message|
|All Latin based languages (includes English)||GSM & Extended GSM||160|
Almost all western languages are based on Latin (such as English, Spanish, French, German, etc.). The most typical character set that is used to interpret the characters and any related permutations such as â,é,ï,ò and ü, is the Unicode character set. This is most widely used by the Worldwide web to display and interpret characters.
In terms of SMS, only the most common characters have been provisioned for in the GSM character set. Due to differences in language, certain characters (called special characters) also had to be provisioned for, which allowed the rise to the Extended GSM character set. This character set is essentially the same as the GSM character set, except that for every single character, an additional byte is used in size. This means that messages which only contain the most basic of characters (A-Z, 0-9) allows 160 characters in a message. The moment a special character is detected, the entire message is shrunk down to only 70 characters, due to the additional space that is consumed to display the character correctly.
Basic Arabic and Simplified Chinese basically falls under the same category as these "special characters". By default, these languages already use a character set that is twice the size of normal Unicode, called UTF-16, causing the default size of a message to remain at 70 characters.
Of course, the GSM character set does not have any provisioning for Arabic or Chinese, so handsets have to make do with different character sets to display these characters correctly.
The translation of these character sets from one to another is all done on the Messaging Cloud. All of our applications are Unicode compliant in order to accommodate the widest range of characters possible. We perform the character translation to ensure that messages arrive correctly and as expected on the various handsets available.
Polish specific encoding considerations:
There are several systems for encoding the Polish alphabet for computers and in SMS. All letters of the Polish alphabet are included in Unicode, and thus Unicode-based encodings such as UTF-8 and UTF-16 can be used. The Polish alphabet is completely included in the Basic Multilingual Plane of Unicode.
The Polish letters which are not present in the English alphabet have the following HTML codes and Unicode codepoints:
A common test sentence containing all the Polish diacritic letters is the nonsensical Zażółć gęślą jaźń ("Yellowize the mind with/of a gusle").