tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Tue, Apr 02, 2013 at 05:31:03PM +0200, Steffen Daode Nurpmeso wrote:
> 
> So, for this, some locale-dependent pre/after parser is or would
> be necessary.

But doesn't this all mean, that this handling is done at the user level.

That a filename is just an user mean to refer to some sequential set of
data.

For this filename to be able to be human readable, it has to be
interpreted with a mapping to glyphes.

That this means that the filename, as far as the kernel is concerned,
should be in an universal encoding and without making assumptions about
the programs.

That UTF-8 is the answer, since this allows to use C "char" (at least an
octet, signed or unsigned) programs.

So that the kernel interface should take and give UTF-8, and that 
filesystem drivers should take and give UTF-8, user level utilities
converting from the current encoding to unicode and UTF-8.

But that's all. If one user really wants to take into account acrobatics
about collating sequences and the like, he can use/develop a program to
do so.

But as far as the kernel and the drivers are concerned, a filename is
uniquely defined by a C char string (happening to be UTF-8).

UTF-8 has the same role as UTC time. There is one and only one canonical
representation, fixed. And the display of the information is customized
according to user level rules.

UTF-8 has the properties (designed for that) to be an historical
compatible encoding (C char strings) with a limited impact in size
for current names (ASCII; even in France, I see rarely names with
accented characters in filenames), and without limitation with what
can be encoded.

What I don't get is why the kernel should be plagued with user level or
typographic manipulation. 

Even an 8bit clean implementation can be "converted" to UTF-8 since it
is just a matter of convention: it can do UTF-8 without even knowing...

UTF-8, I can see. Unicode with 16bits or 24bits, I personnally
don't want to be trapped to use. And, at the kernel or drivers
level, name equality being defined as strcmp(a, b) == 0 (if a user
wants something else at user level, that's his problem and his
choice).

-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Home | Main Index | Thread Index | Old Index