tech-kern: Re: Unicode support in kernel

Subject: Re: Unicode support in kernel
To: Noriyuki Soda <soda@sra.co.jp>
From: Jaromir Dolecek <dolecek@ics.muni.cz>
List: tech-kern
Date: 10/15/1999 10:38:08
Noriyuki Soda wrote:
> >   Why do you need Unicode support in the kernel?  The tables are a  
> > lot of crap to stow in the kernel.

Stowing them into kernel (where they can be shared) is much more
economical than the userland approach, where every process has it's
own copy (though some clever structure of the definition file
would make it possible to mmap() it).  Furthermore, the checks (to
see whether translation of the filename to/from utf-8 is
really necessary) are very cheap in kernel, while they can be quite
costly from userland. Last but not least, you don't need to care
about backwards compatibility so much when everything is in kernel.

Some way can be thought out to upload the tables into kernel at
run time. Each process would have then just an attribute telling
which encoding or charset it uses. Wouldn't be too hard to do and
we would probably have the best of both worlds.

However, this would need much more structural changes than
the original proposal. I don't want to do it with 1.5 around
the corner :-/

> Yes. I'd like to avoid it, but...
> 
> At least, Long filename extenstion to MS-DOS FAT filesystem requires
> Unicode support in kernel, since it encodes filename as both UCS-2 and
> codepage-dependent-codeset. Thus, for example, if userland specified
> a filename as Shift_JIS, kernel has to translate it to Unicode, and
> the reverse is also true.

Really ? I didn't know it. So even windows 95 store the long
filenames on FAT fs in Unicode too ? But there is no mention of Unicode
in our sys/msdosfs/ AFAICS :(

The other thing I thout about like a cool feature to have is 
a way to have the filenames translated from arbitrary
codeset to the codeset of the system (or process). You would
just specify the native codeset of the mounted volume
in the mount time and the filenames would be translated
transparently. *daydream* ;-)

So my conclusion:
The approach with just a mount option telling the codeset the
Unicode filenames should be recoded into has it's shortcomings,
but it much better than we do have now - or, better said, we don't
have now ;-) It may be somewhat crude, but seems like sufficient enough
to be used until "proper" solution is though out and implemented.
If no one would complain very much shortly, I'll finish what I have
now and commit it into tree.

Jaromir
-- 
Jaromir Dolecek <jdolecek@NetBSD.org>      http://www.ics.muni.cz/~dolecek/
"The only way how to get rid temptation is to yield to it." -- Oscar Wilde