tech-kern: Re: Unicode support in kernel

Subject: Re: Unicode support in kernel
To: None <dolecek@ics.muni.cz>
From: Noriyuki Soda <soda@sra.co.jp>
List: tech-kern
Date: 10/15/1999 17:58:21
> > It can be done in library level, or perhaps per process codeset
> > attribute in kernel. Thus don't have to change all userland.
> > (BTW, latter is.... mmmmm ;-))
> 
> I though about it a bit more and doing this per-process attribute
> in kernel would not be actually very hard (at least it doesn't seem
> to be :). Internally, the filenames would be kept in utf-8 and
> on every pass from/to kernel (open(), creat(), getdents() etc.)

Using utf-8 for interchange codeset is not right IMHO, because

- There are characters which cannot be represented by Unicode,
  for example, forthcomming Japanese Industrial Standard of Kanji 
  level 3 and level 4 include such characters.
- It is inefficient. It makes kernel Unicode support mandatory,
  but why do we have to include such support when the codeset which is
  used in filesystem and the codeset which is requested by user are 
  both Shift_JIS?

There is other way to solve this. (see below)

> the filename would be recoded to/from the processes preferred vfs
> charset. The recoding might be even done on library level - if
> the preferred encoding would be in environment, that
> would mean just one more system call (to find out if the
> recoding is necessary for this particular filename). Ha, problem -
> what if ntfs volume would be mounted on some ffs directory ? In
> that case, part of the path would need to be recoded and part not.
> So the recoding would has to happen on namei() level :(

It can be done in library level, if kernel supplies enough information 
about codeset of each pathname component. But I agree that it might
be done better in kernel.
But using utf-8 for interchange codeset is not right as mentioned
above. Instead, always it is better to pass codeset as parameter
always like below:

(1) add a variable, which represents codeset used in pathname of the
  filesystem, to msdosfs, cd9660fs and perhaps ufs mount struct.
  If we add this to ufs, this variable should support "no conversion".
  (i.e. this filesystem includes several codesets in it's pathname).
  And make this variable as mount option.
  Note that NTFS should not have this variable, because it's codeset
  is always UCS-2.
(2) add two system calls which set or get user preferred codeset as
  per process information.
  this preferred codeset should support "no conversion" case,
  filesystem native codeset (i.e. (1)) will be used in each component
  of pathname in this case.
(3) add a codeset parameter to vnode operations which take pathname as
  input parameter. e.g. vop_lookup, vop_create, ...
  this codeset parameter represents codeset of input pathname.
  filesystem should convert this codeset to filesystem native codeset.
  user preferred codeset (i.e. (2)) will be passed as this parameter.
(4) add a codeset return value to vnode operations which take pathname
  as output parameter. e.g. vop_readlink, vop_readdir.
  this codeset parameter represents codeset of output pathname.
  filesystem native codeset (i.e. (1)) will be returned as this
  parameter.
(5) change namei() to remember each codeset of each pathname
  component, since codeset returned from VOP_READLINK() might be
  different from user preferred codeset. pass appropriate codeset to
  corresponding VOP_LOOKUP() call.

In this way,
- characters which cannot be represented by unicode can be used.
  (i.e. we can provide mechanism, not policy)
- efficient, code conversion doesn't happen, if it is not needed.

One of the issues is how to treat when VOP_READLINK() returns 
"no conversion" codeset type (i.e. codeset is not specified in mount
option of underlying ufs of the symbolic link.)  This should not be
treated as "no conversion", but should be treated as user prefered
codeset to achieve user's requirement. But this means there is
possibility that the symbolic link might be treated differently,
depends on user's preferred codeset. Mmmm, but if codeset is
specified in mount option of the filesystem, this problem will
disappear.

> If done right, similar mechanism could be quite easily extended
> to other filesystems (most importanly, ffs). The filenames would be
> stored in ufs-8 and recoded appropriately on-fly. The performance
> hit should not be very bad.
  
It will not be bad on ISO-8859-1, but bad on EUC-jp and Shift_JIS.

> However, I don't feel like doing that right now. Possible
> future work :) For now, I'd just make the charset mount option.
> Okay ?

No. Please see above (1).

You have to distinguish filesystem codeset and user specified
codeset, and NTFS doesn't need that mount option, although
other filesystems needs the option.

Since above scenario will be big impact pathname lookup operation,
I agree that it is better to select shortterm solution like below:
[1] add codeset mount option to msdosfs and cdromfs, but do not add
  this to NTFS.
[2] add user preferred codeset as system wide kernel option.

Again, do not add codeset option to NTFS, it's not right.
User prefered codeset should not be mount option, and
it should be able to be different from filesystem native codeset.

> > MS-DOS filesystem long filename extension uses both Unicode and
> > Shift_JIS in same filesystem, I don't have enough clue about Joliet
> > extension, but I suppose Joliet is same with msdosfs. (to achieve
> > compatibility to existing Shift_JIS CD-ROM).
> 
> You asian folks seem to have big fun with the filenames :-/

Yes, we have a lot of trouble about codeset. ;-)
--
soda