Re: wide characters and i18n

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: wide characters and i18n
From: der Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Fri, 16 Jul 2010 14:55:27 -0400 (EDT)

>>> b) [UTF-8] doesn't use NUL ('\0'),
>> Wrong.  It uses a 0x00 octet (which is what I assume you're talking
>> about) to represent U+0000.  It does not use a 0x00 octet under any
>> other conditions, though.
> Okay ... I'm not up on nomenclature.  U+0000 means .... a particular
> Unicode codepoint?

Right.  U+ followed by four (or more) hex digits refers to Unicode
codepoints (or, if you're in a context where blurring the distinction
between codepoints and their associated characters is appropriate,
sometimes for the character that goes with that codepoint).

> I guess that's a Unicode NULL, according to what I've seen online.

Right.

> I guess the real question is ... I'm used to C-style strings, where I
> don't have to care about the length, but 0x00 is the terminator.  Can
> I still do that with Unicode?

You can with UTF-8; not with some other octetizations (to coin a word)
of Unicode character strings.  In fact, you can to approximately the
same degree you can with ASCII: occasionally people want to process
ASCII text that can include NULs and have to avoid C library routines
as a result, and much the same is true here.

> I mean, I see that U+0000 is a valid Unicode code point, but it's not
> actually anything PRINTABLE, right?

Right.

> Sure, I should be passing around lengths to everything, but I'm just
> thinking of the amount of code that would need to be changed.

Well..."should"?  Only in the sense that you "should" be doing the same
for ASCII text.

>>> [to print a UTF-8-encoded Unicode string] do I simply use
>>> printf("%s") like I have always been?
>> "That depends".
> I'm just thinking of the basic example of, "I want my command-line
> program to print out something to the defined Unix standard output",
> which is what most of them do.

The really really short answer is "yes, do that".  It's probably the
closest available approximation to what you want, and will work in most
cases where anything will.

> From what people are saying ... there's not really a way of telling,
> today, if your terminal supports UTF-8, or 8859-1, or anything else
> (unless it's embedded in locale information, somehow).

Right. :(  Some locale systems actually include charset info as well.
However, even there, making sure that the setting and the reality match
is usually pushed off to the human layer; I could, for example, set
environment variables to claim UTF-8 support in a terminal emulator
doing 8859-7, and software would be almost entirely unable to tell that
it's being lied to - but what I-the-human is seeing would not be what
the software expects based what it's been told.

>> The main reason for it is that UTF-8 wastes half of bandwidth on
>> wire,

This is true only if you normally use a non-ASCII set of characters
that have an 8-bit character set, and you're comparing to such a set.
(Examples might be 8859-7 and KOI-8.  Not 8859-1, because most 8859-1
users draw on the ASCII low half heavily.)

> This brings up a couple of questions:

> - Isn't UTF-8 already ASCII compatible?

For suitable values of "compatible".  It is ASCII compatible in that
taking a string encoded in ASCII using the usual "zero-pad each
character to one octet" convention, converting it (conceptually) to
ASCII characters, mapping them to their Unicode equivalents, and then
encoding the resulting Unicode codepoint string in UTF-8 results in the
same octet sequence you started with.  (Most of those conversion steps
do not actually involve any data massaging, but rather just conceptual
reframes.)

> - How does UTF-8 waste half of the bandwidth?

See above.  If you use, say, KOI-8 (Cyrillic, which strikes me as
likely what Aleksej is using), or 8859-7 (Greek), or 8859-8 (Hebrew)
and are using mostly the non-ASCII half, then UTF-8 encoding results in
two octets per character on the wire for most characters, as opposed to
using KOI-8 (or whatever), which uses one octet per character.

Sometimes this is important; sometimes it's not.  Aleksej has a good
point in that FFS (which is probably what most NetBSD systems use) has
a limit of 255 on directory entry name length - but that's 255 octets,
not 255 characters.  If you have a tendency to use file names in the
100-200 character range, this may well matter to you.  There doubtless
are plenty of other relatively small limits which look smaller when
viewed through UTF-8 glasses....

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: wide characters and i18n
  - From: Aleksej Saushev

References:
- Re: wide characters and i18n
  - From: Ken Hornstein

Prev by Date: Re: wide characters and i18n
Next by Date: Re: wide characters and i18n
Previous by Thread: Re: wide characters and i18n
Next by Thread: Re: wide characters and i18n
Indexes:

Home | Main Index | Thread Index | Old Index