tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



tlaronde%polynum.com@localhost wrote:
 |All in all, as long as filesystems accept 8bits clean pathnames they
 |are UTF-8 ready without knowing and without having to know.

Just in case you also refer to me, i never talked about
filesystem; i never dealt with anything else but userspace.

 |All in all, if something had to be done, it would be modifying base
 |utilities to handle UTF-8 for string matching utilities (the Plan9 paper
 |too; and there are Plan9 implementations as reference).

Of course, all utilities that are affected by LC_CTYPE and
LC_COLLATE are/may require an update.

I think Tom Christiansen has implemented a fully Unicode aware set
of these utilities in Perl(1); i haven't looked at those yet, but
i think it has to support graphems since that should affect
character and word counts etc.
I.e., this task cannot be done correctly with the *w* family.

Btw., his Unicode-Tussle tools are nice to play with regarding
Unicode.

 [.]
 |The locales are already a nightmare, I fail to see the need for a
 |propagation of the disease, that is a demultiplication of "encodings",
 |with strings of characters being not dealt as octets strings, with
 |varying size, endianness and so on. And I hate case insensitive

Logically one cannot decouple charsets from locales since many
aspects are really locale specific.  Especially collation cannot
be decoupled afaik; i.e., German, Austrian and Swiss german has
differences, also in sorting (but please don't ask).

However, most properties of characters are no longer bundled with
a locale, *if* Unicode properties can be used to classify
character data.  This is the basis for dealing with multi-language
documents -- though this is high level; however, one can say
iconv(CHARSET), and then deal with that charset.  No, you don't
need to say iconv("ISO 636[_ISO 3166].CHARSET") for that.

What i would really like to have would be some kind of interface
so that you can say xy(CHARSET) and you get an object, not
a locale_t, but something more restricted.  You cannot do
collation compare, and possibly cmp() could only be expressed as
is_equal(), but anything else would be possible, in practice.
And that would be nice, because when exactly do you have
the complete locale description that is necessary for newlocale()
etc. at hand?

 |Perhaps the problem is unclear wording: UTF-8 is for user space
 |utilities. As far as the kernel is concerned, it does not have to "know"
 [.]
 |That base utilities, specifically string handling utilities, be UTF-8
 |aware is a distinct problem. And extending them without trying to

aah, you did say it.

--steffen


Home | Main Index | Thread Index | Old Index