tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Sun, 31 Mar 2013 19:52:34 -0400 (EDT)
Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:

> I do not want the filesystem interface to mangle the
> provided byte (historically meaning "octet") sequence 

I believe that "octet" interpretation is a retroreification of an
historical assumption. When the filesystem was being invented,
a filename was far from an austere unencoded bytestring.  The simple
fact is that the encoding was assumed to be ASCII.  The technology of
the 70s also made safe the assumption that the effective domain of
filename characters was the ASCII printable set, [33,127].  Only later,
when locales and GUIs came into play, did a rule about (non)encoding
need to be invented.  

An implicit encoding is not no encoding.  

> The filesystem has no business knowing whether the filename is Unicode
> or not - or, more precisely, the name at the filesystem interface
> cannot be Unicode because it isn't a character string at all...except
> in human minds.  

If we ever glimpse the human mind as a physical thing, I bet we won't
catch it holding character strings.  ;-)  

Perhaps you're right that the filesystem "has no business knowing" the
encoding. OTOH, every string is encoded.  Therefore every filename is
encoded; therefore encoding is a property of a filename, regardless of
whether or not the encoding is explicitly declared or recorded.   

I suppose you could say an implicit encoding exists only "in human
minds" insofar as it doesn't exist anywhere else.  That, indeed, is
part of the problem.  

> > Would you be happy with three files in one directory named
> > "årets_fotos" ("the year's pictures"), simply because each happened
> > to be represented in the dirent with different codepoint sequences?
> 
> Happy with in what sense?  Do I think the filesystem should permit it?
> Absolutely.   

Hmm.  I suppose the filesystem on-disk format has to permit it, in the
sense that "the filesystem" is neither defined by NetBSD nor under its
control. File systems are created and used elsewhere, and only later
mounted by NetBSD, so we have to be liberal in what we accept.  

> Do I think it has potential to confuse users?  Of course. Are those
> inconsistent?  No; it is not the filesystem's job to keep users from
> being confused.  ... For that matter, consider three files named IlI,
> lll and lII, which in some fonts render identically and in lots more
> almost so. 

Filenames that are "Confusing" because of glyph similarities are least
visibly distinct, even if only by examining the pixels.  Unicode
combining characters, by contrast, allow distinct byte sequences to
represent identical character sequences. The user has no way to
distinguish them (without examing the bytes).  

Maybe you believe the purpose of a filename is to associate an inode
with blocks on the disk?  I believe the purpose of a filename is to
name a file i.e., to associate a unique name with it.  The human
interface is the râison d'etre of filenames; every other use is
incidental and subordinate.  Any property it possesses that detracts
from its use for that purpose is a flaw.  

I doubt the user breaths to whom "to name a file" means "to
associate a UTF-8-encoded byte sequence with the inode".  To name a
file is to name it uniquely, and "unique" must lie within the limits
of use-perceived distinction.  Unicode combining characters do not meet
that threshold.  

I'm sorry to have been so long-winded.  Let me sum up.  

1.  Filenames are encoded strings.  
2.  The encoding was determined at time of creation. 
3.  Uniqueness is a function of character sequences. 

In particular, please note that filename interpretation is *not*
subject to the current environment.  The meaning of the byte sequence
-- the characters therein represented -- does not change with locale.  

If you start from the premise that filenames are encoded, the question
becomes: where should the encoding be interpreted?  The answer ITSM is
clear: where filename conflicts are detected and resolved, in the
directory-handling logic of the kernel.  Where better to canonicalize
the name?  How else to determine if it's unique?  

--jkl


Home | Main Index | Thread Index | Old Index