tech-kern: codeconv v3 - kernel code set recoding engine

Subject: codeconv v3 - kernel code set recoding engine
To: None <tech-kern@NetBSD.org>
From: Jaromir Dolecek <dolecek@ics.muni.cz>
List: tech-kern
Date: 03/02/2000 20:03:07
Hi,
this is next iteration of development of codeconv.

What it is ?
-=-=-=-=-=-=-
Engine for recoding from one code set to other.

Why is something like this necessary ?
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Well, several filesystems (MSDOSFS, Joliet CD9660, NTFS) use natively
Unicode for filenames [*] and kernel has to translate the filename
to something sane to keep happy all the code and utlities expecting
classic Unix 8bit filename.

The filesystem code now "supports" Unicode by masking the higher
byte and passing just the lower byte, effectively supporting only
iso-8859-1 code set. This is unfortunatete for all people not
satisfied with this code set, e.g. central europian people, people
using cyrillic or Japanese folks.

Why it has to be in kernel ?
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Well, passing the name as is and letting userland deal with it
is not very wise:
1) the kernel itself need some sane representation of filename,
	e.g. for name cache
2) while kernel knows which part of filename is in which
   code set, userland does not; passing that information along
   would be quite expensive. Note that recoding ALL filenames in
   kernel to one universal alhabet and passing all filenames in to
   userland in this alhabet (where it'd recoded to user-preferred
   code set) is unbearable and undoable (e.g.  FFS has no defined
   code set for filenames). Besides that, according to some Japanese
   folks, universal alphabet concept just doesn't work and using
   Unicode for that purpose has some serious problems for them.

Where it came from ?
-=-=-=-=-=-=-=-=-=-=-
Motomichi Matsuzaki <mzaki@e-mail.ne.jp> wrote a set of patches
for FreeBSD's cd9660 code to suppport several code sets, namely
iso-8859-2, koi8-r, eucjp (and of course iso-8859-1) + ability to
recode the names to utf-8. I took the code, added bunch of features,
cleaned it up a lot and generalized it to be usable as universal
code set converter.

It still uses internally Unicode, though the API is not bound to it.

Such engine would probably be usable not only for filesystems,
but also for wscons & USB, probably.

Where is the code available for review ?
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
The code is available on
	http://www.ics.muni.cz/~dolecek/NetBSD/codeconv-v3.tar.gz
or untarred for free browsing on
	http://www.ics.muni.cz/~dolecek/NetBSD/codeconv-v3/

Note that the code above is plain codeconv engine + a little
regression program, the fs-specific diffs to use codeconv
are not included.

Codeconv API
-=-=-=-=-=-=-
The engine currently supports recoding code set from Unicode
to above mentioned code sets, as well as recoding the other way
around. It exposes functions to compare the filesystem-"native"
string with the userland native strings, too.

Here are details to exposed API. The API has changed considerably
from last version.

There are 6 functions externally visible at the moment.

Getting and closing codeconv handle
-----------------------------------

codeconv_t *codeconv_open (const char *kernelcodeset, const char *usercodeset);
int codeconv_close (codeconv_t *);

The usage should be obvisous. The pointer passed to codeconv_close()
has to be the one returned by codeconv_open().

kernelcodeset is name of code set used in kernel, i.e. the filesystem
"native" code set. usercodeset is the code set, in which the filenames
should be presented to userland. Note that ATM the only kernelcodeset
supported is Unicode.

Recode functions
----------------
int codeconv_k2u (codeconv_t *cc, char *dst, size_t dstlen,
	const void *src, size_t srclen, size_t *writelen,
	size_t *readlen);
int codeconv_u2k (codeconv_t *cc, void *dst, size_t dstlen,
	const char *src, size_t srclen, size_t *writelen,
	size_t *readlen);

codeconv_k2u() recodes string in filesystem-native representation
to user land represenation. This recoding might also mean
the codes are encoded, wuch as to EUC form or UTF-8 form.
Both srclen and dstlen are in bytes,
i.e. if the Unicode string has two "characters", srclen
would be 4. If writelen is not-NULL, it's set to the
number of bytes written to "destination" string.
If readlen is not-NULL, it's set to the total number
of bytes read from the source string.

codeconv_u2k() recodes string from user land represenation
to filesystem-native representation. Otherwise it's sense
is same as for codeconv_u2k(), just opposite direction.
The codeconv_t pointer passed to both functions is the
same one returned from codeconv_open().

Note that the kernel representation is assumed to be
in little endian format. As MSDOSFS, NTFS and
Joliet all use it, this assumption seems to be ok to make.

Both codeconv_k2u() and codeconv_u2k() return either 0
if no error was encountered, or E2BIG if the destination
string is not long enough to hold the result of recoding.

Compare functions
-----------------
int codeconv_cmp (codeconv_t *cc, const void *kstr, size_t kstrlen,
	const char *ustr, size_t ustrlen);
int codeconv_icasecmp (codeconv_t *cc, const void *kstr, size_t kstrlen,
	const char *ustr, size_t ustrlen);

kstr is "string" in kernel/filesystem native code set,
ustr is string in user land represenation. cc is assumed
to be the one used for recoding from the appropriate kernel
code set to user land code set.

The functions behave similarily to memcmp() or strcasecmp(),
i.e. returning an integer greater than, equal to, or
less than 0, according as the string kstr is greater than, equal to, or
less than the string ustr (after converting to kernel code set).
codeconv_icasecmp() compares the strings case-insensitively - 
whole Unicode is supported (see codeset-unicode.c:unicode_toupper()
for implementation).

Desired usage
-------------
The filesystems are supposed to use the engine on
per-filesystem basis, probably such as the code set
would be one of parameters to respective mount_FOO(1) command
and passed as string in filesystem-specific mount structure,
which would take care of getting the codeconv handle and
recode the filenames to user-preferred encoding.

Things to be done
-----------------
* make the support for each code set LKMable - right now
  it's condifitonally compiled in on presense of CODECONV_FOO define
  and it's unfortunate - would need some backward compatible
  additions to LKM interface probably, but not very hard
* make it possible to compile some "smaller" version (when CODECONV_SMALL
  is defined ?) which would provide just the most commonly used
  pieces, primarily for use by INSTALL kernels; INSTALL kernels
  should probably not use codeconv at all, if space is an issue
* support recoding between arbitrary two code sets possible
  (not currently needed, though easy - code supports recoding from/to Unicode
   for all the code set, so implementing recoding between arbitary two
   would be matter of writing some glue code)
* adding functions for comparing userland/userland and kernel/kernel
  strings besides userland/kernel and kernel/userland ones ?

I'd appreciate any feedback. I'd like to get this into tree
(as well as the fs-specific bits) before 1.5 cut, if possible
and appropriate.

The API might be enhanced or change a little in final version,
if I'd encounter problems with using codeconv in the filesystem
code in the current form. No major changes are planned, though.

Jaromir

[*] MSDOSFS does use Unicode for long filenames AFAIK
-- 
Jaromir Dolecek <jdolecek@NetBSD.org>      http://www.ics.muni.cz/~dolecek/
@@@@  Wanna a real operating system ? Go and get NetBSD, damn!  @@@@