Re: A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%netbsd.org@localhost
Subject: Re: A draft for a multibyte and multi-codepoint C string interface
From: David Young <dyoung%pobox.com@localhost>
Date: Sun, 30 Jun 2013 14:13:23 -0500

On Mon, Apr 15, 2013 at 05:51:33PM -0400, James K. Lowden wrote:
> On Mon, 15 Apr 2013 11:05:50 +0200
> tlaronde%polynum.com@localhost wrote:
> 
> > If there are user level tools to filter the "ls" output to match the
> > variations (accented, not accented; capitalized, not capitalized;
> > ligatures, no ligatures), fine. But user level.
> 
> If I understand you correctly, the most important point in this
> discussion is that the kernel must make no interpretation of the
> filename.  
> 
> > what do you get for writing the ligature 'oe' in naming a resource
> > 'oeuvres' instead of the plain letters? 
> 
> What do you mean by "plain" letters?  ASCII?  Perhaps my example was
> poorly chosen, because the "oe" ligature is only a custom.  Too many
> languages cannot be represented, even crudely, with ASCII.  
> 
> What you get is the user's ability to name things in his native
> tongue.  
> 
> > What has this to do with a computer resource? 
> 
>       "There are only two problems in computer science.  Cache
> coherency and naming things."  
> 
> I'm interested in useability.  We already *permit* filenames to be
> encoded with UTF-8, but we don't *support* them.  We permit two
> filenames in one directory whose letter sequence is identical if the
> byte sequence differs.  The sort order is arbitrary: "coeur" and
> "c?ur" don't sort next to each other, although they should.  The user
> has no way to know nor reason to care whether "året" uses four
> Unicode code points or five.  If he types "vi året", I think the file
> should open if the character strings match regardless of the
> byte-sequences, but today the odds are 1:4 against.  
> 
> Who considers that state of affairs good?  

It sounds to me like you may want to treat filesystem pathnames like
they are the multilingual, human-readable, high-fidelity *titles* for
files and directories.  You are not alone, but I question both how well
this has worked in practice over the years and whether it can ever
work well in the future.  Even if filesystems have not reached the
pinnacle of their development, it seems that other ways of organizing
and locating files whose development will pay off more.

Consider this real path on my Mac,

Music/iTunes/iTunes\ Music/Sinéad\ O\'Connor/The\ Lion\ and\ the\ Cobra/

Because of the not-so-special but special-nonetheless characters in
the path---the spaces, the accented e, and the apostrophe---to type
that path will be a royal pain.  To find(1) directories containing the
word 'Sinéad' will also be a pain if a) you don't know how to type é
or b) you didn't realize 'Sinéad' was spelled with an é. To write a
script that tolerates paths like that---I wrote one once to de-duplicate
files in the iTunes\ Music directory---requires special care to protect
against spaces being interpreted as field delimiters.

(How do you even type an é with wscons?)

BTW, I did not create that pathname, iTunes did.  I guess that they
do that in a nod to transparency and to the UNIX underpinnings of
Mac OS X.  They didn't have to call it that because, for one thing,
I'm pretty sure that the tracks are described in some MP4 metadata
and indexed elsewhere than in the filesystem.  They could have
called the album folder 'sinead-o-connor/the-lion-and-the-cobra' or
'724433ae32c1d648/0ea91363fb9b3dfa'.

I'm just glad that iTunes did not prevail upon me, the user, to name
that folder.  Users aren't much good at assigning filenames or at
remembering them, later.  There are much better ways to classify,
index, and locate files than to use a hierarchical filesystem,
besides.  To find music files, typically I search by keyword for an
album/performer/song using either the Mac's full-text index, Spotlight,
or the iTunes search box.  They are both smart enough to match 'Sinéad'
with the keyword 'sinead'.  That is a much more direct way to find what
I am looking for than to try to remember and use iTunes' or my own
filing convention.

The words of a song are another useful access point.  While I was
writing, an image from a Sinéad O'Connor music video came to mind.
What was the name of the song?  I could not remember.  The lyrics
"it's been seven hours and fifteen days" came to mind, though, and a
Google search on those lyrics quickly revealed 'Nothing Compares 2U'.
Apparently, iTunes is not sophisticated enough to fetch the lyrics of a
song from the web and use them to index your songs.

Music is not the only content on the computer that I can access more
directly and rapidly by means other than filesystem paths.  Typically
I will access a C or assembly file in the NetBSD kernel using ctags:
vi -t ixgbe_ioctl.  Ctags, as you know, is not very sophisticated or
reliable.  It still gets me where I'm going.  I also have direct access
to programmer's manuals through Spotlight.  'Control-Space 82599'
turns up all of the PDFs concerned with the Intel 82599, including a
specification update that was unhelpfully named 322421.pdf by Intel.

> I'm confident that glob(3) could be adapted to Unicode, that open(2)
> could canonicalize, that ffs could be changed to reflect the encoding,
> and mount(2) to enforce it.  That's just a small matter of
> programming.  For it to happen, though, we need consensus that's it's
> good and necessary.  A consensus that seems surprisingly hard to
> establish.  

Maybe it's good.  I don't know if it's necessary.  Developing a rapid
full-text search capability will probably have a greater and faster
pay-off than trying to make UTF-8 filenames a coherent part of UNIX.

Dave

-- 
David Young
dyoung%pobox.com@localhost    Urbana, IL    (217) 721-9981

Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Daode

References:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: tlaronde
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: James K. Lowden

Prev by Date: Re: /etc/daily, rc.conf, swap, NTP, NFS
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:

Home | Main Index | Thread Index | Old Index