Subject: Re: Unicode support in kernel
To: Noriyuki Soda <soda@sra.co.jp>
From: Bill Studenmund <wrstuden@nas.nasa.gov>
List: tech-kern
Date: 10/14/1999 10:19:58
On Thu, 14 Oct 1999, Noriyuki Soda wrote:

> > Hi,
> > since ntfs pretty much needs some Unicode support in kernel, I'm
> > going to integrate code heavily hacked from Motomichi Matsuzaki's
> > patches for FreeBSD Joliet Unicode support (his original patches
> > are available on http://triaez.kaisei.org/~mzaki/joliet/). The code
> > will be shared by both cd9660 & ntfs and be available for any other
> > filesystem to use.
> 
> Be careful!
> The translation table between existing character set and Unicode,
> which is used by Windows filesystem is different from the table
> defined by Unicode consortium. (Thus, kernel should have to 
> handle several converstion table between Unicode and other 
> character sets, since Apple's conversion table is also different
> from Microsoft's or Unicode consortium's.)

Apple actually has a lot of them. In hfs+, there is a bit field which
encodes which of Apple's translations have been used to generate unicode
names. That way reverse mappings (from unicode) are more likely to work,
and the system only needs to load the reverse mapping routines for code
sets which have been used.

> > Currently, the code uses sysctl to set the preferred target
> > encoding. Since this is not quite that flexible (and would
> > mean that filename cache would have to be flushed for
> > all filesystems using the recoding engine on any change
> > of the characters set), I'm going to make the target
> > character set a mount option.
> 
> Mount option is better, but it is not perfect, because
> - as mentioned above, what happens when the kernel will convert
>   EUC-jp filename to ISO-8859-1?
> - Think about multiuser system that one of the user
>   would like to use iso-8859-1, and another user would like
>   to use euc-jp.

Sounds good.

> > but somehow I don't like it much - it's too deep
> > in directory structure and the Unicode recoding engine
> > is not strictly usable just for filesystems. I'd better
> > put the general interface .c file to sys/kern/kern_unicode.c,
> > headers would go into sys/sys/ and the other .c
> > files to, say, kern/unicode/ or something like that.
> 
> IMHO, that's not right.
> Because
> - The code to support multiple character set is needed for console
>   driver and other things. The library should handle other codesets.
> - iso-10646 is better name than unicode.
> 
> Perhaps, /sys/codeset/ or /sys/lib/codeset is candidate,
> but I think we have to resolve above problems first.

I like sys/lib/codeset, and we could put all the codeset stuff there.
miscfs/genfs is wrong because more than just filesystems will need this
(though filesystems will be the big users).

Take care,

Bill