tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface

On Mon, Apr 01, 2013 at 08:58:41PM -0400, Mouse wrote:
> >> I do not want the filesystem interface to mangle the provided byte
> >> (historically meaning "octet") sequence
> > I believe that "octet" interpretation is a retroreification of an
> > historical assumption.  When the filesystem was being invented, a
> > filename was far from an austere unencoded bytestring.  The simple
> > fact is that the encoding was assumed to be ASCII.
> I find that doubtful...
> > The technology of the 70s also made safe the assumption that the
> > effective domain of filename characters was the ASCII printable set,
> > [33,127].

That is very doubtful. The use of octet values 128 through 255 for
printable characters probably dates to the 1970s and was certainly
common in the early 1980s.
Certainly before the 8859-n character sets were standardised.

The big fuckup with UTF-8 is that it can't take an arbitrary 'char'
sequence, convert it to 'w_char' and then convert it back to the same
'char' sequence. This effectively means that you can only treat strings
that are explicitly known as UTF-8 as such - which generally means that
a program has to be told whether every string might be valid UTF-8 before
processing it.

My suspicions is that this problem only really affects western europeans.
The USA won't have any byte values over 127 in strings, the Asians couldn't
use anything like ASCII.
As an English person I only see files with 0xa3 bytes in them (pound
sterling), but that is enough to cause serious grief with programs
that try to treat the data as UTF-8.
Were I French, I'm sure I'd have bigger problems.

One issue with filenames, is that as soon as you allow multiple byte
sequences (eg supplied to open) to match the same on-disk 'name' you
can no longer do any form of fast of cached lookup - unless you are
also (or only) saving the canonical form. In which case you can do
the canonicalisation in userspace (and not in the normal libc function).


David Laight:

Home | Main Index | Thread Index | Old Index