tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Tue, 16 Apr 2013 01:04:47 +0200
tlaronde%polynum.com@localhost wrote:

> > We permit two filenames in one directory whose letter sequence is 
> > identical if the byte sequence differs.  (...)
> 
> The only thing that is mandatory, is that if I use one identifier
> for my resource, when I reuse the same identifier I have the
> resource. 

That is, we agree, what happens today.  It is also insufficient.  

You may save the file with string S using byte-sequence A; I may attempt
to open that file with string S using sequence B.  Why should that
operation fail?  What possible purpose is served?  

You're of course free to say that our inability to foresee a purpose
doesn't prove there isn't one.  But you're then guaranteeing immediate
known problems for the sake of potential "flexibility" with no known
purpose.  

> > I'm interested in useability. 
> 
> To drop codepages, and to mandate that, on the user level, UTF-8 is
> the rule (bye-bye localization and so on) allows this. 

Allows, but does not, by itself, effect.  

> But this has been done for Plan9 and this should be considered a
> reference. Specifically, what has be done, and what has not be done.

Agreed, Plan 9 offers insight into what to do and what not to do.
I accept that Introducing canonicalization or other Unicode features
wouldn't have served their research purposes.  That does not preclude a
general purpose OS, operating outside a lab, from addressing itself to
questions they didn't answer.  

> Do you mean that a fileserver will have to serve the very same
> filesystem listing differently, rendering differently the names
> depending on the localization of the _client_? 

Hmm.  That's what happens now.  ls(1) copies the bytes in the filename
to standard output.  How the are rendered is a function of the client's
locale, specifically encoding.  Hardly a recipe for correctness.  

Bringing everything to UTF-8 reduces the scope of potential error.  A
general solution -- which I don't recommend -- would be to answer Yes
to your question, provided the filename's encoding is *known* (not
assumed, as now).  Then userland could convert between the filename's
encoding and the user's encoding.  

> If someone wants to put a factotum that intercepts creation of
> filenames, or search for filenames implementing some policy, this is
> fine. But this has nothing to do in the kernel, neither something to
> do in the base by default

There's no precedent for alternative userland functionality.  Unless
you propose to have a version of libc and of ls(1) and associated
utilities in pkgsrc?  That would be a large engineering effort, and a
pointless one.  

Quite simply, we know a string may have more than one Unicode
encoding, and we know filenames are strings.  How can we expect
programs that manipulate those strings to do so correctly if they can't
identify the character boundaries?  

If the programs that manipulate those strings don't deal with encoding
ambiguity, users will have to.  I'm saying they shouldn't have to, and
therefore anything that touches a filename must be Unicode-aware.  

As I said, we *allow* UTF-8 filenames today.  We just don't support
them.  We ignore the problems.  Or insist they're features.  

--jkl


Home | Main Index | Thread Index | Old Index