tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



> Hi, I'm trying to understand how to write portable C code that
> supports international character sets.

The very first thing you need to do is determine just what "supports"
means for you here.

Personally, the biggest problems I've run into have been due to the
mismatch between octet strings and character strings.  There are a lot
of places where I as a coder get octet strings but humans think of them
as character strings, and the mismatch can be problematic.  File names
are an example: most Unixish filesystems actually name files with octet
strings, not character strings; for example, a file name consisting of
a single lowercase beta generated by a user using 8859-7 is
indistinguishable from a file name consisting of a single lowercase
a-circumflex generated by a user using 8859-1, and from a filename
consisting of a 0xe2 octet generated by an application that uses
filenames to store binary data that does not represent characters at
all: the file name is not a character sequence but an octet sequence
(which may or may not be an encoded character sequence).

As an example of the sort of problem this confusion between octets
strings and character strings engenders, the ssh spec is, strictly,
unimplementable on NetBSD (and probably other Unix variants), because
things like user names and passwords in the OS are octet strings,
whereas the protocol specifies that they are character strings encoded
in UTF-8.  This means that, for example, it is impossible for the
implementation to tell whether a given username on the wire should
match a username in the user database, because there is no way to tell
what encoding the stored octet string was generated using.

> As I understand so far, it has a lot to do with C library and current
> locale setting.

Depending on what you want to do, it might.

> 1. What is the recommended way for user level applications to deal
> with different character encodings?

I doubt there is a "_the_ recommended way" - or, at least, if there is
the recommendation should be ignored because it came from someone who
either has an axe to grind or hasn't thought about the issues.  What
the most sensible way to deal with different encodings is depends on
what you need to do.  For some purposes, for example, all input data is
tagged with a character set (peerhaps implicitly) and it's enough to
just make sure you preserve that marking through whatever processing
you do.  For other purposes, it is necessary to recode, but nothing
more.  For yet others, what you outline (convert everything to some
>8-bit type for internal use) is a right answer.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index