tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



>> I'm talking about the way 0x00 and 0x2f octets are special in
>> pathnames at the syscall interface.  This is annoying to
>> applications that want to name files with fundamentally
>> non-character-string data.  The live use case I mentioned a message
>> or two ago is an example - that really _wants_ to name files with
>> time_t values (actually, time_t plus a disambiguator serial number).
> I think you will admit that your example of "binary" filenames is not
> a common use.

Of course.  But, since it's not (currently or historically) supported,
this means nothing.  Anyone with such a desire _has_ to encode the name
at least a little; given that, most-to-all of them probably do
basically what I did and encode them in a moderately low base and then
store those with octet values known to turn into relatively convenient
(for the relevant humans) characters when hit with tools that insist on
interpreting them as characters.

> This means that to use these filenames, the utilities have to know
> about the convention.

Of course.  Anything that deals with information stored in filenames
(type 2) - as opposed to treating them as opaque strings, whether of
octets or of characters - has to know how it's stored.

> A "binary" filename can be represented without ado by an hexadecimal
> (or whatever else base) string that is full ASCII, that is
> identically UTF-8, that is already possible now without changing
> everything (and if one really wants, the binary filenames can have a
> suffix giving the base: 0x...).

Sure.  It's a workaround - basically, what I said above: encode in a
low base (16) and then store the resulting numbers as octet values
known to be convenient for humans.  But the existence of a workaround
doesn't change the fundamental desire.  (Nitpick: that's actually a
prefix, not a suffix.)

> The solution is mainly in userspace, not at the kernel level, because
> UTF-8 allows "more", allows an hexadecimal encoding allowing all, and
> is compatible with all utilities expecting strings.

It's not, though.  It means that everything that handles pathnames as
other than opaque octet strings has to be aware of the normalization
rules, if only to avoid inadvertently generating two names which clash.

It's the problems generated by case-insensitive filesystems all over
again, only worse because the mapping is significantly more
complicated, because sequences of different lengths are "identical",
and because it's more of a moving target (admittedly not much more).

> But UTF-8 is the most interesting solution,

Perhaps to you.

> because it allows existing, allows non existing, and does not cost a
> lot of modifications to the existing code base.

It costs quite a lot, actually, because it would mean that everything
working with pathnames as other than opaque octet strings has to be
aware of its idiosyncracies, such as the normalization rules, as I
mentioned above.  This would be a significant burden even if the only
encoding ever used were UTF-8, which definitely is not the case.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index