tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



Thanks to everyone for the answers to my dumb questions; it does help fill
in the gaps a lot.

I have a few followup questions that if anyone knows the answer to, then
that would be appreciated.

Okay, I now know what Unicode is.  A followup question ... it seems
that Unicode is developed in tandem as ISO-10646; do people mostly
consider those the same, or are there differences that affect
implementation details?

As people have explained, UTF-8 is a Unicode encoding that is a) a sequence
of bytes, b) doesn't use NUL ('\0'), c) is a superset of ASCII so ASCII
continues to work, and d) characters may take more than one byte.

Obviously the last one is the one that presents a number of challenges.
Okay, fine.  For what I would say are my "normal" applications I don't
really do that much on a character-by-character basis; I generally deal
with whole strings.  From what people are saying, if I treat characters
as arrays of char and as opaque objects, then I can simply say everything
is UTF-8 and most stuff should work fine, right?  Obviously I'll have to
know somehow if I get something from a file or the network is UTF-8 and
do the right thing.

But this brings up some possibly dumb questions: say I have a UTF8 byte
sequence I want to display on standard out; do I simply use printf("%s")
like I have always been?  Do I have to do something different?  If so,
what?

Sad Clouds suggested using wchar_t (and I am assuming functions like
wprintf()) everywhere.  I see the functions to translate character strings
into wchar_t ... but what do I use if I know that I have UTF-8?  And
the reason I asked earlier about locale is that the locale affects the
way the multibyte character routines behave, which makes me think that
the locale setting affects the encoding all of those routines are using.

--Ken


Home | Main Index | Thread Index | Old Index