tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Sun, Apr 07, 2013 at 10:54:34AM -0400, Mouse wrote:
> 
> It costs quite a lot, actually, because it would mean that everything
> working with pathnames as other than opaque octet strings has to be
> aware of its idiosyncracies, such as the normalization rules, as I
> mentioned above.

No. Did you read Rob Pike and Ken Thompson' paper about UTF-8? They
decided to not take into account the way the characters may be
rendered or collating sequences and so on. The base utils deal with
UTF-8 but do not address these. The "has to be aware of its
idiosyncracies" is not a requirement, but an engineering decision. And
the best solution, IMHO, is to drop it completely.

All in all, as long as filesystems accept 8bits clean pathnames they
are UTF-8 ready without knowing and without having to know.

All in all, if something had to be done, it would be modifying base
utilities to handle UTF-8 for string matching utilities (the Plan9 paper
too; and there are Plan9 implementations as reference).

What I don't understand is---but I may have incorrectly understood
the subject since I'm not an english native speaker---a proposal
to put encodings considerations in the filesystems or in the file
handling system calls. "Octet strings are enough for everybody due
to UTF-8 if one wants". And it happens to be what is already done
at this level: C strings.

The locales are already a nightmare, I fail to see the need for a
propagation of the disease, that is a demultiplication of "encodings",
with strings of characters being not dealt as octets strings, with
varying size, endianness and so on. And I hate case insensitive
filesystems, so I don't want an enforcement of some cryptic policy
deciding that two distinct strings passed are, indeed, the same thing:
this is a user decision, not a system one. If two distinct strings are
indeed exactly the same thing, there should be only one way to write
them. If there are subtleties, this means that they are different. One
or the other.

Perhaps the problem is unclear wording: UTF-8 is for user space
utilities. As far as the kernel is concerned, it does not have to "know"
that this is UTF-8 since it does not do whatever acrobatics, and takes
it as a C string and that's all, two distinct filenames being defined as
strcmp(a, b) != 0---so this is still an opaque octets string and I
wouldn't be happy to discover that this is not any longer the case. 
It's a user level problem to take what is given by
the user and normalize the thing depending on whatever policy one wants
to implement. But it is even not the problem of the system. It would be
"pkgsrc" options so to speak.

That base utilities, specifically string handling utilities, be UTF-8
aware is a distinct problem. And extending them without trying to
address the corner cases would be a vast improvement (improvement
since the present behavior will stay as is; this is the purpose of
UTF-8).

The engineering discussion is all in the Rob Pike and Ken Thompson'
paper.
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Home | Main Index | Thread Index | Old Index