tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



> Okay, I now know what Unicode is.  A followup question ... it seems
> that Unicode is developed in tandem as ISO-10646; do people mostly
> consider those the same, or are there differences that affect
> implementation details?

Personally?  I consider them the same.

But I know that is a simplification.  For my purposes, so far, it's
been an ignorable simplification.  If I were doing something
sufficiently serious, I would make sure I took the time to look into
whether it remained ignorable.

> As people have explained, UTF-8 is a Unicode encoding that is a) a
> sequence of bytes,

Right.

> b) doesn't use NUL ('\0'),

Wrong.  It uses a 0x00 octet (which is what I assume you're talking
about) to represent U+0000.  It does not use a 0x00 octet under any
other conditions, though.

> c) is a superset of ASCII so ASCII continues to work,

Mostly.  It's more like "a sequence of Unicode characters all in the
ASCII range is represented in UTF-8 as the same string of octets as the
same string represented in the usual `just store ASCII in octets'
convention".

That is, given any string in ASCII stored one character per octet with
the high bits set to 0 (the usual convention for storing ASCII strings
in octet strings), the same sequence of octets is valid UTF-8 for the
Unicode codepoint string for the same characters.

> and d) characters may take more than one byte.

Right.

> Obviously the last one is the one that presents a number of
> challenges.

Well, it's one of them.  "Characters do not all take the same number of
octets" is another property UTF-8 has which can cause trouble (though,
like your (d), it's implied by other properties put together).

> From what people are saying, if I treat characters as arrays of char
> and as opaque objects, then I can simply say everything is UTF-8 and
> most stuff should work fine, right?

If everything actually _is_ UTF-8, and you don't need to do any
particular processing, then yes, you can just treat strings as
content-opaque octet sequences.  But that's true of pretty much any
encoding.

> But this brings up some possibly dumb questions: say I have a UTF8
> byte sequence I want to display on standard out; do I simply use
> printf("%s") like I have always been?  Do I have to do something
> different?  If so, what?

"That depends".  It depends on whether printf tries to be smart (most
printfs I'm familiar with treat strings as opaque octet sequences for
things like %s, but I'd be surprised if there weren't some that went to
the trouble to process characters rather than octets).  It depends on
how the octet sequence produced by your program is interpreted
(terminal or terminal emulator handling UTF-8 or 8859-1 or what).  It
depends on what exactly you mean by "display on standard out", too.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index