Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%netbsd.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: "James K. Lowden" <jklowden%schemamania.org@localhost>
Date: Tue, 2 Apr 2013 18:31:05 -0400

On Mon, 1 Apr 2013 20:58:41 -0400 (EDT)
Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:

> >> I do not want the filesystem interface to mangle the provided byte
> >> (historically meaning "octet") sequence
> > I believe that "octet" interpretation is a retroreification of an
> > historical assumption.  When the filesystem was being invented, a
> > filename was far from an austere unencoded bytestring.  The simple
> > fact is that the encoding was assumed to be ASCII.
> 
> I find that doubtful...

The implicit assumption of ASCII encoding permeates the description of
Unix from the very beginning.  The decision to adopt ASCII was
conscious.  Surely it's no coincidence that terminals and printers were
ASCII, and that filenames are lifted from the directory and splatted
out to the terminal uninterpreted.  

        "namei" is called (5770) with a second parameter of zero to
locate the named file.  ("u.u_arg[0]" contains the address in the user
space of a character string which defines a file path name.)
        -- Lions Commentary on UNIX 6th Edition, page 18-3

A "character string", he says.  He doesn't mention its encoding because
he doesn't have to.  He doesn't call it an octet sequence because
that's not what he means to say.  

I suppose the emphasis on the "octet" interpretation probably dates to
the rise of 8-bit encodings later standardized as ISO 8859-x.  

> > The technology of the 70s also made safe the assumption that the
> > effective domain of filename characters was the ASCII printable set,
> > [33,127].
> 
> ...and this too, because Unix has supported non-ASCII-printables in
> filenames pretty much from day one (certainly as of BSD 4.1c, which is
> as far back as I personally go, but from what I've heard, this
> property goes clear back to, what, V5?).  

Supported, yes, encouraged, no.  All I'm suggesting is that using Unix
from the command line on a VT-100 rewards naming files with
alphanumeric characters.  Even spaces are a nuisance, as you know; it
wasn't until the advent of point-and-click UIs that embedded spaces in
filenames became convenient.  

> > OTOH, every string is encoded.
> 
> I'm not convinced...unless you mean that every character string
> represented in a computer is encoded, which is true but not very
> relevant.  

That is what I mean.  Once you accept the idea the filenames are
strings and strings are encoded, you no longer need defend the airless
proposition that they're uninterpretable octet sequences.  

> But then, what's passed to the syscalls, what's
> recorded on disk, isn't the filename; it's the encoding of the
> filename, with the information about which encoding was used lost.

Lost?  Yes, in the sense that information not saved is lost, and the
encoding is not saved.  OTOH, there's no need to record an invariant:
if every filename is ASCII-encoded, why record the encoding?  But it's
still encoded, regardless.  

> > Unicode combining characters, by contrast, allow distinct byte
> > sequences to represent identical character sequences.
> 
> This is a good reason to avoid using Unicode for encoding filenames
> (that is to say, for converting abstract filenames to encoded strings
> for use with syscalls).

I don't see any remotely practical alternative.  Unicode has been
adopted everywhere: by Microsoft, Apple, and most flavors of Linux.
UTF-8 was invented specifically to fit well with Unix's null-terminated
string convention.  It *is* a good fit.  It's just not a perfect,
drop-in replacement for ASCII, because byte-wise comparison is
insufficient to establish equality.  

> > I believe the purpose of a filename is to name a file i.e., to
> > associate a unique name with it.

Let me clarify that.  I didn't mean that an inode can't be linked to
more than once.  By "unique", I meant that no two names in the same
directory should represent the same character string, regardless of
their byte sequence.  

The ineluctable conclusion is that the software that enforces
uniqueness must be able to interpret the encoding.  ISTM that means we
need Unicode functionality in the kernel.  I don't see another choice,
except to punt.  

--jkl

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

Prev by Date: Re: posix shared memory
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index