tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n

On Sat, 10 Jul 2010 23:01:10 -0400 (EDT)
der Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:

> > Hi, I'm trying to understand how to write portable C code that
> > supports international character sets.
> The very first thing you need to do is determine just what "supports"
> means for you here.
> Personally, the biggest problems I've run into have been due to the
> mismatch between octet strings and character strings.  There are a lot
> of places where I as a coder get octet strings but humans think of
> them as character strings, and the mismatch can be problematic.  File
> names are an example: most Unixish filesystems actually name files
> with octet strings, not character strings; for example, a file name
> consisting of a single lowercase beta generated by a user using
> 8859-7 is indistinguishable from a file name consisting of a single
> lowercase a-circumflex generated by a user using 8859-1, and from a
> filename consisting of a 0xe2 octet generated by an application that
> uses filenames to store binary data that does not represent
> characters at all: the file name is not a character sequence but an
> octet sequence (which may or may not be an encoded character
> sequence).

I guess this can be a problem if the user has one locale setting,
UTF-8 for example, but different filenames are encoded in different
encodings. If you want to do something like regular expression string
matching, you would call mbsrtowcs() to convert multi-byte filename
string to a fixed wide character string.

What I'm trying to figure out is this: if filename encoding does not
match user's locale setting, mbsrtowcs() can stop on a character
sequence it does not think is legal, how do you skip it? It could be 2,
or 4-byte characters, but how do you know for sure? Do you just keep
calling mbsrtowcs() with 1 byte increments until it manages to decode
the next character sequence?

Home | Main Index | Thread Index | Old Index