[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
On Jul 11, 2010, at 05:40, Sad Clouds wrote:
> On Sun, 11 Jul 2010 07:19:12 -0400 (EDT)
> der Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:
>>> If you want to do something like regular expression string matching,
>>> you would call mbsrtowcs() to convert multi-byte filename string to
>>> a fixed wide character string.
>> Maybe. If you want to do regular expression matching against
>> _character_ strings, yes. If _octet_ strings, no.
> I'm not sure if simply comparing 8-byte integer units is going to work.
> Some encodings (e.g. JIS) may use escape sequences to indicate shifting
> to two byte encoding.
> If the escape sequence to shift to Kanji is '<ESC>$B' and you're
> looking for ASCII '$' character, then part of the escape sequence will
> It seems to defeat the whole point of doing character comparison,
> because you end up matching control data, which is not part of a
> logical character sequence that represents the string.
1. there's a fundamentally nasty problem that UNIX itself has never dealt with
in a general way: what's in that file? Text? MPEG streams? JPEG?
The answer has typically been, "well, most of our software handles text
(meaning 7-bit ASCII), and if you need something different, you have software
to obtain or write ..." Which is to say, UNIX mostly avoided the question,
other than by implication of the formats that the installed base of software
was prepared to handle.
Extending that to ISO-8859-1 ("ISO Latin 1") was easy ("make everything 8-bit
clean! No, that's not a parity bit any more!") and that handled what was then
the western "free" world that traded in computers and software. Those of us
involved in the IETF MIME effort did our best to think beyond that limited view
of the world and try to make it possible (if a bit messy) for everyone to
exchange information in character sets which express their native languages.
However, the IETF is explicitly (with one glaring, embarrassing exception)
agnostic about software - they care about "bits on the wire" (protocols) not
whatever your OS may be doing (e.g. APIs). It's similar to declaring a language
for a technical conference - so long as you can express your thoughts in that
language to the other attendees, who cares what your native language is? Keep
your notes in whatever script you like.
Theoretically, the POSIX locale stuff is supposed to handle things beyond that,
but it's a more complicated and subtle problem than those POSIX committees
really thought about. Just setting LANG environment variable (and its
associates) to where ever you are or whatever you speak/read really tells the
system and software exactly nothing about the content of the files you are
manipulating - LANG speaks more of I/O to you, i.e. what you're prepared to
read on a display, and what sorts of characters you'll be inputting from your
... input devices.
I commend this well written paper to your attention:
which discusses what the Plan 9 people (Rob Pike, Ken Thompson, et. al) did
about the software problem (and what they did about it), and explicitly what
they decided to punt on. A precis: "we replaced the ASCII assumption with
Unicode/UTF-8 because UTF-8 is a proper superset of ASCII (i.e. backward
compatible) and also subsumes pretty much all other interesting character sets
(with some warts) so we can translate into it without (much) semantic
A little history of how UTF-8 actually came about is here:
2. With regard to file contents, there are three approaches: guessing, assuming
a default, or explicit meta-data (magic bytes/cookies, filename extensions, or
... a field in the inode (or whatever filesystem meta-data bundle you have).
Guessing has obvious disadvantages and probable violations of der Mouse's
cherished "principle of least astonishment" (I cherish that principle, too).
See file(1) for a rather heroic guessing program.
Assuming a default ... well, that depends on the default, and, as you point
out, what happens if the file contents and the default don't match? That could
lead to an astonishing result, just like guessing wrong. Not good to sort(1) a
shift-JIS file if sort(1) is only expecting ASCII. We might do better if the
assumption is UTF-8 and we both modify base system software to deal with that,
and provide tools e.g. iconv(1), to convert into and out of UTF-8.
Which leaves explicit meta-data. Apple went this route in HFS from day one of
MacOS with Type and Creator right in the "Finder info" (their version of an
inode) though it took them a very long time to deal with the both the
interchange issue and the notion that there could be a file format standards
that multiple programs can view/manipulate (e.g. RTF, PDF, MPEG, JPEG).
UNIX and Microsoft DOS (and its successors) have been using both in-file magic
cookies and filename "extensions" though in UNIX filename extensions were
always a convention rather than anything required by the OS or the filesystem;
period is just another valid filename character. Apple has been going in this
direction with MacOS X with stated intent to abandon the explicit meta-data
they already have in their filesystem; given their UI, I think that's a
mistake. UNIX seems to work pretty reasonably with it's mishmash of conventions
... but then I don't use X11 (too many years of using the much more thoroughly
integrated MacOS environment has spoiled me - every time I try to use X, I have
to work hard to suppress the strong desire to do violence to the people
responsible for it).
Even explicit meta-data leaves us with a nasty "M by N" problem: M
programs/libraries to modify for N different code sets ...
A whole lot of software has already been written to deal with this problem (but
not necessarily completely or well), and you would do well to research what's
available before attempting to reinvent your own rounder wheel - someone might
have already solved your particular problem ... just not in the base NetBSD
Main Index |
Thread Index |