tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



See, I think I am understanding things, but then ....

>> b) doesn't use NUL ('\0'),
>
>Wrong.  It uses a 0x00 octet (which is what I assume you're talking
>about) to represent U+0000.  It does not use a 0x00 octet under any
>other conditions, though.

Okay ... I'm not up on nomenclature.  U+0000 means .... a particular
Unicode codepoint?  I guess that's a Unicode NULL, according to what
I've seen online.

I guess the real question is ... I'm used to C-style strings, where I
don't have to care about the length, but 0x00 is the terminator.  Can
I still do that with Unicode?  I mean, I see that U+0000 is a valid
Unicode code point, but it's not actually anything PRINTABLE, right?
Sure, I should be passing around lengths to everything, but I'm just
thinking of the amount of code that would need to be changed.

>> But this brings up some possibly dumb questions: say I have a UTF8
>> byte sequence I want to display on standard out; do I simply use
>> printf("%s") like I have always been?  Do I have to do something
>> different?  If so, what?
>
>"That depends".  It depends on whether printf tries to be smart (most
>printfs I'm familiar with treat strings as opaque octet sequences for
>things like %s, but I'd be surprised if there weren't some that went to
>the trouble to process characters rather than octets).  It depends on
>how the octet sequence produced by your program is interpreted
>(terminal or terminal emulator handling UTF-8 or 8859-1 or what).  It
>depends on what exactly you mean by "display on standard out", too.

I'm just thinking of the basic example of, "I want my command-line program
to print out something to the defined Unix standard output", which is what
most of them do.  From what people are saying ... there's not really a way
of telling, today, if your terminal supports UTF-8, or 8859-1, or anything
else (unless it's embedded in locale information, somehow).

Also, Aleksej says:

>Sorry, this is wrong. This assumes that you don't use anything ASCII
>compatible (more or less). I do, and "UTF-8 by default" will cause major
>pain to me and to many users here.
>
>The main reason for it is that UTF-8 wastes half of bandwidth on wire,
>and some of NetBSD tools don't tolerate long file names. E.g. pax.
>I meet border cases already, and UTF-8 by default will double on-wire
>length of file names in consideration.

This brings up a couple of questions:

- Isn't UTF-8 already ASCII compatible?
- How does UTF-8 waste half of the bandwidth?
- What would you prefer we do instead?

--Ken


Home | Main Index | Thread Index | Old Index