Re: wide characters and i18n

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: wide characters and i18n
From: der Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Fri, 16 Jul 2010 12:06:59 -0400 (EDT)

> I use wchar_t when I need to know that each character is represented
> by a fixed size object.  [...] For example if you have a filename:

> some_long_file_name.txt

> To quickly extract the suffix '.txt' you just scan the string from
> right to left, until you hit '.' char.  I think with utf-8 this type
> of string manipulation would be quite messy and you would have to use
> a special library that understands utf-8 encodings, etc.

In the case of scanning for a '.', it's totally trivial, because that's
in the ASCII range and thus (a) it's represented as a single 0x2e octet
and (b) a 0x2e octet does not occur under other circumstances.  But in
the more general case, where you are or might be scanning for a
non-ASCII character, it's not that easy.  Even then, though, it's not
bad; one of UTF-8's nice properties is that it is trivial to identify
whether a given octet is the first octet of a character or not, thus
making it fairly easy to scan a string from right to left.  With a
little extra work you can even accumulate the codepoint as you scan
backwards through the octets, so you don't have to scan backwards for
character beginnings and then forward to get the codepoints.

> There are two problems with C wide characters:

There are a lot more than two. :-)

> 1. Switching do different locales while the program is running is not
> thread-safe

This is an implementation issue.

> and may result in weird errors.  This means you can only use one
> locale during program run time.

Thread (un)safety doesn't mean you can't switch locales; for example,
if your program is not threaded, thread safety is totally irrelevant.
If you can't switch locales at run time, that's a separate bug.

> 2. [C library wide-character support is badly designed]

Agreed.  Entirely.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

References:
- Re: wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: Ken Hornstein
- Re: wide characters and i18n
  - From: Sad Clouds

Prev by Date: Re: wide characters and i18n
Next by Date: Re: wide characters and i18n
Previous by Thread: Re: wide characters and i18n
Next by Thread: Re: wide characters and i18n
Indexes:

Home | Main Index | Thread Index | Old Index