Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Mon, 1 Apr 2013 20:58:41 -0400 (EDT)

>> I do not want the filesystem interface to mangle the provided byte
>> (historically meaning "octet") sequence
> I believe that "octet" interpretation is a retroreification of an
> historical assumption.  When the filesystem was being invented, a
> filename was far from an austere unencoded bytestring.  The simple
> fact is that the encoding was assumed to be ASCII.

I find that doubtful...

> The technology of the 70s also made safe the assumption that the
> effective domain of filename characters was the ASCII printable set,
> [33,127].

...and this too, because Unix has supported non-ASCII-printables in
filenames pretty much from day one (certainly as of BSD 4.1c, which is
as far back as I personally go, but from what I've heard, this property
goes clear back to, what, V5?).  I suspect it's supported high-half
octets as well, though I'm less sure of that.

> Perhaps you're right that the filesystem "has no business knowing"
> the encoding.  OTOH, every string is encoded.

I'm not convinced...unless you mean that every character string
represented in a computer is encoded, which is true but not very
relevant.  It is _not_ the case that every octet string is an encoded
character string (though, with many encodings, every octet string can
be interpreted as an encoded character string; one of the problems with
UTF-8 in some contexts is that this is not true of it).

> Therefore every filename is encoded; therefore encoding is a property
> of a filename, regardless of whether or not the encoding is
> explicitly declared or recorded.

Hmm, that's arguable.  But, even aside from the premise you cited but I
disagree with, it's being rather slippery about just what a "filename"
is.  If a filename is the abstract character string (granting,
arguendo, your assumption that every octet string is an encoded
character string), then yes, every filename is encoded by the time it
makes it into code.  But then, what's passed to the syscalls, what's
recorded on disk, isn't the filename; it's the encoding of the
filename, with the information about which encoding was used lost.

But if a filename is the thing passed to the syscall or recorded on
disk, then it's an octet string, not a character string (though it may
be - usually will be, in most cases - obtained by encoding a character
string).

> For that matter, consider three files named IlI, lll and lII, which
> in some fonts render identically and in lots more almost so.
> Filenames that are "Confusing" because of glyph similarities are
> least visibly distinct, even if only by examining the pixels.

Should tbat he "...are at least..."?  Then I disagree; some fonts use
bit-identical glyphs for, eg, l and I.  If not, I'm not sure what you
mean.

> Unicode combining characters, by contrast, allow distinct byte
> sequences to represent identical character sequences.

This is a good reason to avoid using Unicode for encoding filenames
(that is to say, for converting abstract filenames to encoded strings
for use with syscalls).

> Maybe you believe the purpose of a filename is to associate an inode
> with blocks on the disk?

No.  That's the function of di_db[] and di_ib[], or their analogs in
the filesystem in question.

The purpose of a filename, as I see it, is to name a file, to provide a
short handle to refer to it by.  Exactly what this means depends on
what you mean by "filename", as I sketched above.

> I believe the purpose of a filename is to name a file i.e., to
> associate a unique name with it.

Then we need to throw out UFS, because it doesn't associate names with
files, but rather with links to files.  "touch x; ln x y" leaves you
with one file with two _different_ names (to the extent that it makes
sense to speak of a file having a name at all, that is).

> The human interface is the râison d'etre of filenames; every other
> use is incidental and subordinate.

I see this as a confusion of an abstract thing with its implementation.

It certainly would be possible to create a filesystem in which
filenames _are_ character strings, octet strings with associated
encoding tags.  It wouldn't surprise me if it had been done.  Indeed,
for sufficiently coarse granularity of encoding tagging, I've heard of
it being done.

But the conceptual naming of files with names (of which character
strings are just an implementation, actually) should not be confused
with the implementation of that concept as naming files-as-implemented
with encoded character strings.  Some of those implementations actually
go further, naming files with octet strings, which can be used to
implement encoded character strings but can also be used in other ways.
That's the Unix choice, and it's proven to be a good one, because of
the flexibility it gives.  Consider filenames (in the implementaiton
sense) like qfr31Nf5CD001566, which straddle the line: its main purpose
is as a software-to-software interface, to store a smallish blob of
data being transmitted from software at one time to software at a later
time.  (It uses an octet string which makes sense as a character string
in a common encoding not because that makes it any better at its
primary function, but rather because it makes things easier on humans
when they need to poke at that stuff manually and it doesn't make it
much worse at its primary function.

> 1.  Filenames are encoded strings.
> 2.  The encoding was determined at time of creation.
> 3.  Uniqueness is a function of character sequences.

That's a nice theory.  It has very little to do with Unix, though.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: David Laight
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: posix shared memory
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index