tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On 2013-04-06 23:47Z, Mouse wrote:
> I'm talking about the way 0x00 and 0x2f octets are special in pathnames
> at the syscall interface.  [...]
> After all, there's no reason d_name[] has to have any forbidden octet
> values (to pick the filesystem I know best), and, indeed, non-Unix
> systems speaking to Unix NFS servers have been known to create
> directory entries which (in Unix terms) have slashes in their d_name
> strings, much to the consternation of people trying to work with them
> from the Unix side.  (I've never heard of the analogous problem with
> 0x00, probably at least in part because the NFS clients tend to use
> C-style strings too, probably at least in part because the NFS servers
> _also_ tend to use C strings.)

Indeed, encoding all filenames as UTF-8 might allow "clever" engineers
to design "solutions" around that "problem", thanks to the properties of
the UTF-8 encoding: U+002F could be encoded as (malformed) \xC0\xAF, and
U+0000 as (also malformed) \xC0\x80.
I am not sure you would like it.
Certainly Microsoft engineers ans sysadmins at large did not like that
idea when it was used and abused to exploit their systems.

The point here is that while UNIX "octets" strings only prevents two
well-defined code points, using UTF-8 opens a large door for that class
of hacks; and of course NFD's door is even bigger, since adding more
rules here just makes it more complex to get right at every implementation.


Antoine


Home | Main Index | Thread Index | Old Index