tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface

On Tue, 2 Apr 2013 18:08:01 +0200 wrote:

> UTF-8 has the same role as UTC time. There is one and only one
> canonical representation, fixed. And the display of the information
> is customized according to user level rules.

UTC is a simpler problem.  With UTF-8, the same set of characters may be
represented by more than one set of bytes.  And, while NetBSD may
prevent non-canonical sequences in filenames, it must be able to mount
and cope with filesystems that were not so carefully managed by
other systems.  

> So that the kernel interface should take and give UTF-8, and that 
> filesystem drivers should take and give UTF-8, user level utilities
> converting from the current encoding to unicode and UTF-8.
> But that's all. If one user really wants to take into account
> acrobatics about collating sequences and the like, he can use/develop
> a program to do so.

You can't fob it off to userspace.  At least I don't think so.  

Consider open(2).  Every element in the pathname needs
canonicalization.  OK, userspace can do that.  But what if the
filesystem doesn't conform?  Say, because it's a CD-ROM, or a camera,
never mind NFS/sshfs/samba/PUFFs.   

ISTM that to open a file, the kernel needs a more sophisticated
definition of string equality than a byte-for-byte comparison.  At the
very least, it has to be able to canonicalize extant names on the disk,
and to deal somehow with duplicates.  


Home | Main Index | Thread Index | Old Index