tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Sun, 7 Apr 2013 18:08:07 +0200
tlaronde%polynum.com@localhost wrote:

> All in all, as long as filesystems accept 8bits clean pathnames they
> are UTF-8 ready without knowing and without having to know.

If filenames are to have a known encoding, something somewhere must
enforce it and flag errors (for extant names where another system failed
to enforce it).  That seems very basic and uncontroversial to me.
This discussion has helped me see that enforcement could be done in
userspace, and perhaps should be.   

> What I don't understand is---but I may have incorrectly understood
> the subject since I'm not an english native speaker---a proposal
> to put encodings considerations in the filesystems or in the file
> handling system calls. 

Filenames already use an implicit encoding (at least, most of them
do) from the user's perspective.  If the encoding were recorded in the
filesystem, it would facilitate correct interpretation of the names.  

> The locales are already a nightmare ...
> I hate case insensitive filesystems, 

These issues are beside the point.  I don't see how locale affects
encoding and decoding a character string. 

> If two distinct strings are indeed exactly the same thing, there
> should be only one way to write them. If there are subtleties, this
> means that they are different. One or the other.

Exactly so.  Unicode characters have more than one encoding, because of
combining characters.  If the filesystem uses one normalized encoding,
something somewhere must convert user-provided strings to that encoding
before byte-wise matching will work.  

> The engineering discussion is all in the Rob Pike and Ken Thompson'
> paper.

I had read that paper before, and found it helpful to review in light
of this discussion.  Thank you for mentioning it.  

--jkl


Home | Main Index | Thread Index | Old Index