tech-kern: Re: codeconv v3 - kernel code set recoding engine

Subject: Re: codeconv v3 - kernel code set recoding engine
To: None <dolecek@ics.muni.cz>
From: Noriyuki Soda <soda@sra.co.jp>
List: tech-kern
Date: 03/08/2000 01:04:50
> > For example:
> >         (1) VFAT vs SJIS userland.
> >                 codeconv_t *k2u = codeconv_open("UTF-16LE", "SJIS");
> >                 codeconv_t *u2k = codeconv_open("SJIS", "UTF-16LE");
> >         (2) SJIS MS-DOS fs (not VFAT, but FAT) vs UTF-8 userland:
> >                 codeconv_t *k2u = codeconv_open("SJIS", "UTF-8");
> >                 codeconv_t *u2k = codeconv_open("UTF-8", "SJIS");
> 
> FAT used really used SJIS ? EUC-encoded ?

FAT of Japanese MS-DOS doesn't support eucJP, only supports SJIS.

> I always though that FAT supports only subset of ASCII - namely
> [A-Z0-9_-?] + one dot.

FAT (before VFAT age) of Japanese MS-DOS only supports SJIS.
Yet another surprising thing is that SJIS FAT contains "\" (0x5c) as
pathname character. (as second byte of kanji).

> >         (3) NFSv4 with UTF-8 vs SJIS userland
> >                 codeconv_t *k2u = codeconv_open("UTF-8", "SJIS");
> >                 codeconv_t *u2k = codeconv_open("SJIS", "UTF-8");
> > I think there is no reason to use one codeconv_t for opposite
> > direction conversion.
> 
> As I said, I though it would be convenient. That's the only
> reason I've done it this way for now :)

If so, please do not do like that.

> > No, it does cost.
> > There are cases that only one direction conversion is needed.
> 
> But typically, caller would need conversion in both directions,
> so why not provide it with what is commonly needed ?

The assumption is wrong.
For example, Japanese console i/o often only requires one directional
conversion (i.e. for output only). Because input side is covered by
userland input method. (The input method is typically > 1MB process
size, and > 5MB dictionary size).

> Furthermore, separate codeconv_enc() & codeconv_dec() (or whatever
> they would be named) provide better type checking, FWIW.

No.
	u = codeconv_k2u(cc, k);
	k = codeconv_u2k(cc, u);
isn't different from
	u = codeconv(k2u_cc, k);
	k = codeconv(u2k_cc, u);
about type checking.

> > IMHO, passing endiannes is wrong abstraction. Why passing endianess is
> > needed although more general function like iconv(3) doesn't need that?
> 
> I imagine there might be other options which might be "configurable"
> per-codeconv and usable for several code sets. But using unique
> code set name (like "Unicode-LE") is also ok.

Mm, "Unicode-*" is bad name, too. :-)
There are many unicode variants, e.g.
	UTF-7
	UTF-8
	UTF-16 little endian
	UTF-16 big endian
	UTF-16 with byte order mark
So, please don't just use "Unicode", but please use "UTF-16XX" or
something.

> > It makes sense to use/share same function and implementation for NTFS
> > and Joliet extension.
> > But it doesn't make sense to implement it on codeconv layer.
> 
> To me, it makes good sense - codeconv has all information it needs.
> It knows both the "source" and "target" code set. It knows best how to
> compare individual codes in a string.

Hmm, I'll try to think about better way to define name comparison
functions.  Could you wait for a while?

> > Case folded comparison is quite difficult than what you thought.
> > For example, I've heard that there is a difference between MS-Windows
> > 98 and MS-Windows NT about filename comparison. (e.g. handling of
> > Cyrillic characters)
> 
> Well, we don't need to emulate case comparison as done by specific
> operating systems - we can do it right :) The only case where code
> depends on case folded comparison is in NTFS - file names in NTFS
> directory are indexed case-insensitively.

No. (at least for case conversion functions)
If we don't use same way with original OS, we might make a filename
which cannot be accessed from orignal OS. :-<

> > you cannot use following codeconv_t:
> >         codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE");
> > rather, you have to use this:
> >         codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE-Win95");
> >                 for Windows 98
> >         codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE-WinNT");
> > 
> > Do you really want to do this?
> 
> If Win95 Unicode and WinNT Unicode are really different, we need to do
> this anyway, as you've noted in a followup mail.

Yup.
But that doesn't mean code conversion layer should support case folding.
--
soda