Re: wide characters and i18n

To: Sad Clouds <cryintothebluesky%googlemail.com@localhost>
Subject: Re: wide characters and i18n
From: Erik Fair <fair%netbsd.org@localhost>
Date: Wed, 14 Jul 2010 19:38:42 -0700

On Jul 11, 2010, at 05:40, Sad Clouds wrote:

> On Sun, 11 Jul 2010 07:19:12 -0400 (EDT)
> der Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:
> 
>>> If you want to do something like regular expression string matching,
>>> you would call mbsrtowcs() to convert multi-byte filename string to
>>> a fixed wide character string.
>> 
>> Maybe.  If you want to do regular expression matching against
>> _character_ strings, yes.  If _octet_ strings, no.
> 
> I'm not sure if simply comparing 8-byte integer units is going to work.
> Some encodings (e.g. JIS) may use escape sequences to indicate shifting
> to two byte encoding.
> 
> If the escape sequence to shift to Kanji is '<ESC>$B' and you're
> looking for ASCII '$' character, then part of the escape sequence will
> match.
> 
> It seems to defeat the whole point of doing character comparison,
> because you end up matching control data, which is not part of a
> logical character sequence that represents the string.

two comments:

1. there's a fundamentally nasty problem that UNIX itself has never dealt with 
in a general way: what's in that file? Text? MPEG streams? JPEG?

The answer has typically been, "well, most of our software handles text 
(meaning 7-bit ASCII), and if you need something different, you have software 
to obtain or write ..." Which is to say, UNIX mostly avoided the question, 
other than by implication of the formats that the installed base of software 
was prepared to handle.

Extending that to ISO-8859-1 ("ISO Latin 1") was easy ("make everything 8-bit 
clean! No, that's not a parity bit any more!") and that handled what was then 
the western "free" world that traded in computers and software. Those of us 
involved in the IETF MIME effort did our best to think beyond that limited view 
of the world and try to make it possible (if a bit messy) for everyone to 
exchange information in character sets which express their native languages.

However, the IETF is explicitly (with one glaring, embarrassing exception) 
agnostic about software - they care about "bits on the wire" (protocols) not 
whatever your OS may be doing (e.g. APIs). It's similar to declaring a language 
for a technical conference - so long as you can express your thoughts in that 
language to the other attendees, who cares what your native language is? Keep 
your notes in whatever script you like.

Theoretically, the POSIX locale stuff is supposed to handle things beyond that, 
but it's a more complicated and subtle problem than those POSIX committees 
really thought about. Just setting LANG environment variable (and its 
associates) to where ever you are or whatever you speak/read really tells the 
system and software exactly nothing about the content of the files you are 
manipulating - LANG speaks more of I/O to you, i.e. what you're prepared to 
read on a display, and what sorts of characters you'll be inputting from your 
... input devices.

I commend this well written paper to your attention:

http://plan9.bell-labs.com/sys/doc/utf.html

which discusses what the Plan 9 people (Rob Pike, Ken Thompson, et. al) did 
about the software problem (and what they did about it), and explicitly what 
they decided to punt on. A precis: "we replaced the ASCII assumption with 
Unicode/UTF-8 because UTF-8 is a proper superset of ASCII (i.e. backward 
compatible) and also subsumes pretty much all other interesting character sets 
(with some warts) so we can translate into it without (much) semantic 
information loss."

A little history of how UTF-8 actually came about is here:

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

2. With regard to file contents, there are three approaches: guessing, assuming 
a default, or explicit meta-data (magic bytes/cookies, filename extensions, or 
... a field in the inode (or whatever filesystem meta-data bundle you have).

Guessing has obvious disadvantages and probable violations of der Mouse's 
cherished "principle of least astonishment" (I cherish that principle, too). 
See file(1) for a rather heroic guessing program.

Assuming a default ... well, that depends on the default, and, as you point 
out, what happens if the file contents and the default don't match? That could 
lead to an astonishing result, just like guessing wrong. Not good to sort(1) a 
shift-JIS file if sort(1) is only expecting ASCII. We might do better if the 
assumption is UTF-8 and we both modify base system software to deal with that, 
and provide tools e.g. iconv(1), to convert into and out of UTF-8.

Which leaves explicit meta-data. Apple went this route in HFS from day one of 
MacOS with Type and Creator right in the "Finder info" (their version of an 
inode) though it took them a very long time to deal with the both the 
interchange issue and the notion that there could be a file format standards 
that multiple programs can view/manipulate (e.g. RTF, PDF, MPEG, JPEG).

UNIX and Microsoft DOS (and its successors) have been using both in-file magic 
cookies and filename "extensions" though in UNIX filename extensions were 
always a convention rather than anything required by the OS or the filesystem; 
period is just another valid filename character. Apple has been going in this 
direction with MacOS X with stated intent to abandon the explicit meta-data 
they already have in their filesystem; given their UI, I think that's a 
mistake. UNIX seems to work pretty reasonably with it's mishmash of conventions 
... but then I don't use X11 (too many years of using the much more thoroughly 
integrated MacOS environment has spoiled me - every time I try to use X, I have 
to work hard to suppress the strong desire to do violence to the people 
responsible for it).

Even explicit meta-data leaves us with a nasty "M by N" problem: M 
programs/libraries to modify for N different code sets ...

A whole lot of software has already been written to deal with this problem (but 
not necessarily completely or well), and you would do well to research what's 
available before attempting to reinvent your own rounder wheel - someone might 
have already solved your particular problem ... just not in the base NetBSD 
distribution.

        Erik <fair%netbsd.org@localhost>

Follow-Ups:
- Re: wide characters and i18n
  - From: David Holland
- Re: wide characters and i18n
  - From: Joerg Sonnenberger
- Re: wide characters and i18n
  - From: der Mouse
- Re: wide characters and i18n
  - From: Sad Clouds

References:
- wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: der Mouse
- Re: wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: der Mouse
- Re: wide characters and i18n
  - From: Sad Clouds

Prev by Date: Re: wide characters and i18n
Next by Date: Re: wide characters and i18n
Previous by Thread: Re: wide characters and i18n
Next by Thread: Re: wide characters and i18n
Indexes:

Home | Main Index | Thread Index | Old Index