[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
> I use wchar_t when I need to know that each character is represented
> by a fixed size object. [...] For example if you have a filename:
> To quickly extract the suffix '.txt' you just scan the string from
> right to left, until you hit '.' char. I think with utf-8 this type
> of string manipulation would be quite messy and you would have to use
> a special library that understands utf-8 encodings, etc.
In the case of scanning for a '.', it's totally trivial, because that's
in the ASCII range and thus (a) it's represented as a single 0x2e octet
and (b) a 0x2e octet does not occur under other circumstances. But in
the more general case, where you are or might be scanning for a
non-ASCII character, it's not that easy. Even then, though, it's not
bad; one of UTF-8's nice properties is that it is trivial to identify
whether a given octet is the first octet of a character or not, thus
making it fairly easy to scan a string from right to left. With a
little extra work you can even accumulate the codepoint as you scan
backwards through the octets, so you don't have to scan backwards for
character beginnings and then forward to get the codepoints.
> There are two problems with C wide characters:
There are a lot more than two. :-)
> 1. Switching do different locales while the program is running is not
This is an implementation issue.
> and may result in weird errors. This means you can only use one
> locale during program run time.
Thread (un)safety doesn't mean you can't switch locales; for example,
if your program is not threaded, thread safety is totally irrelevant.
If you can't switch locales at run time, that's a separate bug.
> 2. [C library wide-character support is badly designed]
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Main Index |
Thread Index |