tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



> As an English person I only see files with 0xa3 bytes in them (pound
> sterling), but that is enough to cause serious grief with programs
> that try to treat the data as UTF-8.  Were I French, I'm sure I'd
> have bigger problems.

Indeed.  I live and work in Quebec, where French and English are both
common.  I have seen programs like sort and grep silently truncate
their input at the first occurrence of a non-ASCII octet, probably
because (in the problematic cases) it's not valid UTF-8.  Mercifully,
I've had this strike only on Linux (so far); It's gotten so bad I've
taken to copying data to NetBSD for processing.

> One issue with filenames, is that as soon as you allow multiple byte
> sequences (eg supplied to open) to match the same on-disk 'name' you
> can no longer do any form of fast of cached lookup - unless you are
> also (or only) saving the canonical form.  In which case you can do
> the canonicalisation in userspace (and not in the normal libc
> function).

To a point; you still have to do something with non-canonical octet
sequences on the disk.  But, more to the point, as soon as you're doing
the same job (canonicalization) in two places (kernel and userland)
there is the risk that they will work differently somehow, leading to
subtle bugs.  This is especially dangerous with something like Unicode
that's a moving target; fortunately most of the things affecting
canonicalization don't move much, but it doesn't take much.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index