TheLib-Featurette: Strings

The maybe most important coding project of myself which has impact on my private programming as well as on my work is thelib_icon16 TheLib. The basic idea is to collect all classes which we (two friends and myself) wrote and used several times in several different projects over and over again. These classes usually are wrappers for compatibility or convenience around API calls or library calls (e.g. STL, Boost, whatever). That’s where the name of our lib cames from: Totally Helpful Extensions. And, it is just cool to write: #include "the/exception.h".

However, I hear very often: “Why do you write a lib? There are plenty already for all tasks.”

If that would be true, none of us would write programs anymore and we would only “compose” programs from libs. Well, we don’t. Or, rather I don’t. Meaning: TheLib really is helpful. It is not a replacement for the other libs. It’s a complement, and extension.

On Example: Strings!

The string functionality in TheLib is not nearly as powerful as one would required to write a fully fledged text processor. This is not the goal of TheLib. We wrote these functions to provide somewhat beyond basic functionality. The idea is to enable simple applications or prototypical applications to easily implement nice-to-use interfaces for the user.

Especially unter Linux (but also unter Windows) there are usually a total of three different types of strings:

  1. char * or std::string which store ASCII or ANSI strings with locale dependent character sets
  2. char * or std::string which store multi-byte strings, e.g. using UTF-8 encoding
  3. wchar_t * or std::wstring which store unicode strings.

Depending on these types different API functions need to be called, e.g. for determining the length of the string:

  1. strlen
  2. multiple calls of mbrlen
  3. wcslen

On issue that arises between case 1 and 2 is that modern Linux often uses a locale which stores UTF-8 strings within the standard strings. As long as strings are only to be writte, stored, and displayed, this is a great way to maintain compatibility and gain the modern feature of special character availability. However, as soon as to perform a more complex operation (like creating a substring) this approach results in unexpected behaviour als the bytes of a single multi-byte character are threated like independent characters.

Example:

  • Your user is a geek and enters “あlptraum” as input string.
  • This string is stored in std::string using the utf8-en encoding.
  • Your application now wants to extract the first character for some reason (e.g. to produce typographic capitalization using a specialized font).
  • The normal way of doing this is accessing char* first_char = s[0]; and std::string remaining = s.substr(1);
  • Because the japanese “あ” uses two bytes, this results in: “0” + “Blptraum”

This not only applies to japanese characters, but obviously to almost all characters with diacritics. What is even more important: this issue also results in unexpected behaviour when using (or implementing) string operations which ignore case, e.g. comparisons.

Example of changing a string to lower case:

// we will do this the STL-way:
// http://notfaq.wordpress.com/2007/08/04/cc-convert-string-to-upperlower-case/

std::string data;
// contains 'data' is set to "あlptraum" encoded with utf8-en locale

std::transform(data.begin(), data.end(), data.begin(), ::tolower);
// well, content of 'data' is now: "0blptraum"
// ...

To avoid this problem, TheLib internally initializes the system locale for the application and detects if the locale uses UTF-8 encoding. If it does, all TheLib string functions will call the multi-byte API functions to work as expected. In addition TheLib provides some functions to explicitly convert from or to UTF-8 strings (e.g. for file io).

Of course, you don’t need TheLib to do this. You can use another lib (probably. I only know the IBM-Unicode-Lib, which seems like a huge hulk) or you can use your own workarounds or you can ignore such problems as “they will not occure in your application scenarios”. However, having TheLib doing the job is just handy. Nothing more.

Tagged with:

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.