tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



On Fri, 16 Jul 2010 10:56:36 -0400
Ken Hornstein <kenh%pobox.com@localhost> wrote:

> But this brings up some possibly dumb questions: say I have a UTF8
> byte sequence I want to display on standard out; do I simply use
> printf("%s") like I have always been?  Do I have to do something
> different?  If so, what?
> 

That's the good thing about utf-8, you can treat it as a sequence of
normal char objects. If your terminal supports utf-8, then any sequence
of non-ascii chars should be displayed correctly.

> Sad Clouds suggested using wchar_t (and I am assuming functions like
> wprintf()) everywhere.  I see the functions to translate character
> strings into wchar_t ... but what do I use if I know that I have
> UTF-8?  And the reason I asked earlier about locale is that the
> locale affects the way the multibyte character routines behave, which
> makes me think that the locale setting affects the encoding all of
> those routines are using.

I use wchar_t when I need to know that each character is represented by
a fixed size object. This way you can have a pointer to a string and
look at every character individually just by incrementing the pointer.
Sometimes I do it from left to right, but occasionally I may need to do
it from right to left. For example if you have a filename:

some_long_file_name.txt

To quickly extract the suffix '.txt' you just scan the string from
right to left, until you hit '.' char. I think with utf-8 this type of
string manipulation would be quite messy and you would have to use a
special library that understands utf-8 encodings, etc.

The multi-byte conversion functions are affected by the current locale.
Normally you would call

setlocale(LC_CTYPE, "");

at the start of your program and during your program run you don't
change locale. Setting empty locale will make multi-byte conversion
functions query users locale environment variable and perform
conversion based on that. So different users can use different locales,
which may result in different character encoding schemes, however C
library wide character functions should transparently handle that.

There are two problems with C wide characters:

1. Switching do different locales while the program is running is not
thread-safe and may result in weird errors. This means you can only use
one locale during program run time.

2. The interfaces for C library multi-byte to wide, and wide to
multi-byte conversion functions are so badly designed, it's not even
funny. The biggest problem with those functions is the fact they expect
NULL terminated strings. If you have a partial (not NULL terminated)
string in the buffer, you cant call string conversion function on it,
because it won't stop until it finds a NULL and you end up with buffer
overrun. You cannot "artificially" NULL terminate the string, because
after reading NULL char, the function will reset mbstate_t object to the
initial state. This will mess up the next sequence of multi-byte
characters if the encoding had state.

I spent two days, jumping through the hoops and trying to figure out
how to convert partial strings. I think I nailed it in the end with 30%
performance penalty, but still 3.5 times faster than iconv().

If anyone is interested, I can post the code for the wrapper
functions...


Home | Main Index | Thread Index | Old Index