tech-kern: Re: Unicode support in kernel

Subject: Re: Unicode support in kernel
To: None <dolecek@ics.muni.cz, tech-kern@netbsd.org>
From: Noriyuki Soda <soda@sra.co.jp>
List: tech-kern
Date: 10/14/1999 19:38:50
> Hi,
> since ntfs pretty much needs some Unicode support in kernel, I'm
> going to integrate code heavily hacked from Motomichi Matsuzaki's
> patches for FreeBSD Joliet Unicode support (his original patches
> are available on http://triaez.kaisei.org/~mzaki/joliet/). The code
> will be shared by both cd9660 & ntfs and be available for any other
> filesystem to use.

Be careful!
The translation table between existing character set and Unicode,
which is used by Windows filesystem is different from the table
defined by Unicode consortium. (Thus, kernel should have to 
handle several converstion table between Unicode and other 
character sets, since Apple's conversion table is also different
from Microsoft's or Unicode consortium's.)

> The unicode support will be pulled in when either ntfs or cd9660
> were included into the kernel, or options UNICODE would 
> be in kernel config. The latter is necessary so that
> it would be possible to load cd9660 and ntfs as LKM.

Converting unicode to other characterset is quite problematic.
How does that code handle the problem when the kernel will convert
EUC-jp filename to ISO-8859-1?

IMHO, for NTFS, best solution is leave the conversion to userland.
(like emacs's path-coding-system variable.)

Unicode of VFAT has different problem, but it seems your patch
doesn't concern with this.

And, normal CD9660 is not limited to Unicode, there is CD-ROM
which is encoded by Shift-JIS and other character sets.

And also, Please don't use "options UNICODE", 
but use "options CODESET_UNICODE" or something, because there are
other character sets which is needed to support filesystems and
consoles.

> It's possible to specify the character set into which the Unicode
> filenames will be translated. Always available is utf-8 and
> I'll add iso-8859-1 probably. Other currently supported
> encodings are iso-8859-2, koi8-r and euc-jp.

Mmm, this might cause serious problem.

> Currently, the code uses sysctl to set the preferred target
> encoding. Since this is not quite that flexible (and would
> mean that filename cache would have to be flushed for
> all filesystems using the recoding engine on any change
> of the characters set), I'm going to make the target
> character set a mount option.

Mount option is better, but it is not perfect, because
- as mentioned above, what happens when the kernel will convert
  EUC-jp filename to ISO-8859-1?
- Think about multiuser system that one of the user
  would like to use iso-8859-1, and another user would like
  to use euc-jp.

> The thing I'm not sure about is where the files should
> go. It's one unicode_subr.c file, two headers (unicode.h
> and unicode_subr.h) and about 6 .c files implementing
> the charset/encodings recoding. The proposals I got so far
> were sys/lib/libunicode (or sys/lib/unicode) and
> sys/miscfs/genfs/. I'd probably go for the latter,

> but somehow I don't like it much - it's too deep
> in directory structure and the Unicode recoding engine
> is not strictly usable just for filesystems. I'd better
> put the general interface .c file to sys/kern/kern_unicode.c,
> headers would go into sys/sys/ and the other .c
> files to, say, kern/unicode/ or something like that.

IMHO, that's not right.
Because
- The code to support multiple character set is needed for console
  driver and other things. The library should handle other codesets.
- iso-10646 is better name than unicode.

Perhaps, /sys/codeset/ or /sys/lib/codeset is candidate,
but I think we have to resolve above problems first.
--
soda