[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
On Fri, 16 Jul 2010 10:56:36 -0400
Ken Hornstein <kenh%pobox.com@localhost> wrote:
> But this brings up some possibly dumb questions: say I have a UTF8
> byte sequence I want to display on standard out; do I simply use
> printf("%s") like I have always been? Do I have to do something
> different? If so, what?
That's the good thing about utf-8, you can treat it as a sequence of
normal char objects. If your terminal supports utf-8, then any sequence
of non-ascii chars should be displayed correctly.
> Sad Clouds suggested using wchar_t (and I am assuming functions like
> wprintf()) everywhere. I see the functions to translate character
> strings into wchar_t ... but what do I use if I know that I have
> UTF-8? And the reason I asked earlier about locale is that the
> locale affects the way the multibyte character routines behave, which
> makes me think that the locale setting affects the encoding all of
> those routines are using.
I use wchar_t when I need to know that each character is represented by
a fixed size object. This way you can have a pointer to a string and
look at every character individually just by incrementing the pointer.
Sometimes I do it from left to right, but occasionally I may need to do
it from right to left. For example if you have a filename:
To quickly extract the suffix '.txt' you just scan the string from
right to left, until you hit '.' char. I think with utf-8 this type of
string manipulation would be quite messy and you would have to use a
special library that understands utf-8 encodings, etc.
The multi-byte conversion functions are affected by the current locale.
Normally you would call
at the start of your program and during your program run you don't
change locale. Setting empty locale will make multi-byte conversion
functions query users locale environment variable and perform
conversion based on that. So different users can use different locales,
which may result in different character encoding schemes, however C
library wide character functions should transparently handle that.
There are two problems with C wide characters:
1. Switching do different locales while the program is running is not
thread-safe and may result in weird errors. This means you can only use
one locale during program run time.
2. The interfaces for C library multi-byte to wide, and wide to
multi-byte conversion functions are so badly designed, it's not even
funny. The biggest problem with those functions is the fact they expect
NULL terminated strings. If you have a partial (not NULL terminated)
string in the buffer, you cant call string conversion function on it,
because it won't stop until it finds a NULL and you end up with buffer
overrun. You cannot "artificially" NULL terminate the string, because
after reading NULL char, the function will reset mbstate_t object to the
initial state. This will mess up the next sequence of multi-byte
characters if the encoding had state.
I spent two days, jumping through the hoops and trying to figure out
how to convert partial strings. I think I nailed it in the end with 30%
performance penalty, but still 3.5 times faster than iconv().
If anyone is interested, I can post the code for the wrapper
Main Index |
Thread Index |