Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: "James K. Lowden" <jklowden%schemamania.org@localhost>
Date: Sun, 31 Mar 2013 18:28:47 -0400

On Sun, 31 Mar 2013 13:03:24 -0400 (EDT)
Mouse <mouse%Rodents-Montreal.ORG@localhost> wrote:

> For example, in filenames, I would consider <combining-acute><e> to be
> different from <e-acute>, just as today I consider
> <underscore><backspace><a> in a filename different from
> <a><backspace><underscore>.  

Those situations are not analogous.  Unicode combining sequences are
common, and not under the user's control.  For reasons of security and
convenience, the filesystem must canonicalize Unicode
filenames.  Functions such as open(2) and glob(3) need to match on
different codepoint sequences for a single input string.  

Please allow me to explain.  

I think we can agree that a backspace in a filename is rare.  I for one
believe that the filesystem should prohibit the use of ASCII control
characters in filenames.  At least it should be a mount option.  

When it comes to Unicode, the situation changes.  As you know, ordinary
characters used in everyday speech may contain combining characters.
To take a trivial concrete example for the sake of discussion, the
Swedish words for "is" and "year" -- är and år, respectively -- can be
represented by several different Unicode codepoint sequences.  The
codepoint for the angstrom symbol, U+212b, looks like the
codepoint for a capital å, U+00c5: Å.  

Would you be happy with three files in one directory named
"årets_fotos" ("the year's pictures"), simply because each happened to
be represented in the dirent with different codepoint sequences?  In
what sense would that be a Good Thing?  

(I don't know what other OSs have done; it doesn't seem like much.  A
quick trip to Google and though the docs on my local Ubuntu system show
that bash and glob(7) provide for "equivalent" sequences, e.g. "[=a=]"
matches most things that look like "a".  They are silent on the
question of various codepoint sequences for "å".)  

Certainly the Worse is Better school would push the problem out to
userland and absolve the filesystem.   ISTM filenames are there for the
user's sake, and filename uniqueness is judged at the semantic level of
linguistic perception.  Leaving him to fend for himself against
Unicode's unfortunate complexity is a disservice.  

--jkl

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Ken Hornstein

References:
- A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse

Prev by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index