tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Sun, 31 Mar 2013 18:41:06 -0400
Ken Hornstein <kenh%cmf.nrl.navy.mil@localhost> wrote:

> >(I don't know what other OSs have done; it doesn't seem like much.  A
> >quick trip to Google and though the docs on my local Ubuntu system
> >show that bash and glob(7) provide for "equivalent" sequences, e.g.
> >"[=a=]" matches most things that look like "a".  They are silent on
> >the question of various codepoint sequences for "__".)  
> 
> On MacOS X all filenames are UTF-8, NFD (so they're all decomposed).
> Composed codepoints in a filename are decomposed into their base
> character and combining character.
> 
> I believe under Solaris if you mount with a special Unicode option you
> can use either composed or decomposed and the original byte sequence
> is used as the filename, but you can't create two files that have the
> same normalization 

In the systems you're describing, the OS prohibits certain kinds of
duplication, defined as two names sharing the same normalized byte
sequence.  How do they deal with duplicates when mounting filesystems
that permit them?  

I'm curious about how duplicate names are made distinct to the user.  I
assume it's mangled somehow, perhaps by adding "-1" to the name.  

If one name is mangled, I wonder what happens if open(2) is called with
the exact on-disk byte sequence of the mangled name -- as recorded on
disk, not as presented to the user.  Which file is opened, if any?
Does the system prefer an exact match to the original sequence,
or normalize it to match the other filename that wasn't mangled,
or return file not found?  

For clarity (I hope) imagine these two files

        UTF-seq Norm    Representation
        A       "abc"   "abc"
        B       "abc"   "abc-1"

Given two filenames with byte sequences A and B representing the same
character sequence per normalization rules, how are they represented so
as to let the user distinguish them, and how does the OS respond when
open is called with the pre-mangled sequence B?  

--jkl


Home | Main Index | Thread Index | Old Index