tech-kern: Re: codeconv v3 - kernel code set recoding engine

Subject: Re: codeconv v3 - kernel code set recoding engine
To: Noriyuki Soda <soda@sra.co.jp>
From: PER4MANCE, J. Dolecek <jdolecek@per4mance.cz>
List: tech-kern
Date: 03/07/2000 16:27:31
Noriyuki Soda wrote:
> No. It is better to use different codeconv_t.
> For example:
>         (1) VFAT vs SJIS userland.
>                 codeconv_t *k2u = codeconv_open("UTF-16LE", "SJIS");
>                 codeconv_t *u2k = codeconv_open("SJIS", "UTF-16LE");
>         (2) SJIS MS-DOS fs (not VFAT, but FAT) vs UTF-8 userland:
>                 codeconv_t *k2u = codeconv_open("SJIS", "UTF-8");
>                 codeconv_t *u2k = codeconv_open("UTF-8", "SJIS");

FAT used really used SJIS ? EUC-encoded ? I always though that
FAT supports only subset of ASCII - namely [A-Z0-9_-?] + one dot.
Oh god :(

>         (3) NFSv4 with UTF-8 vs SJIS userland
>                 codeconv_t *k2u = codeconv_open("UTF-8", "SJIS");
>                 codeconv_t *u2k = codeconv_open("SJIS", "UTF-8");
> I think there is no reason to use one codeconv_t for opposite
> direction conversion.

As I said, I though it would be convenient. That's the only
reason I've done it this way for now :)

> No, it does cost.
> There are cases that only one direction conversion is needed.

But typically, caller would need conversion in both directions,
so why not provide it with what is commonly needed ?

Furthermore, separate codeconv_enc() & codeconv_dec() (or whatever
they would be named) provide better type checking, FWIW.

> IMHO, passing endiannes is wrong abstraction. Why passing endianess is
> needed although more general function like iconv(3) doesn't need that?

I imagine there might be other options which might be "configurable"
per-codeconv and usable for several code sets. But using unique
code set name (like "Unicode-LE") is also ok.

> It makes sense to use/share same function and implementation for NTFS
> and Joliet extension.
> But it doesn't make sense to implement it on codeconv layer.

To me, it makes good sense - codeconv has all information it needs.
It knows both the "source" and "target" code set. It knows best how to
compare individual codes in a string.

> Case folded comparison is quite difficult than what you thought.
> For example, I've heard that there is a difference between MS-Windows
> 98 and MS-Windows NT about filename comparison. (e.g. handling of
> Cyrillic characters)

Well, we don't need to emulate case comparison as done by specific
operating systems - we can do it right :) The only case where code
depends
on case folded comparison is in NTFS - file names in NTFS
directory are indexed case-insensitively.

I'm not surprised if MS would not do the case comparison correctly
under Win9X ;-/ But since MS Windows 95/98 support VFAT & cd9660/Joliet
only, we don't need to care, AFAICS.

> If you combine case-folded comparison feature with codeconv layer,
> you cannot use following codeconv_t:
>         codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE");
> rather, you have to use this:
>         codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE-Win95");
>                 for Windows 98
>         codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE-WinNT");
> 
> Do you really want to do this?

If Win95 Unicode and WinNT Unicode are really different, we need to do
this
anyway, as you've noted in a followup mail.

Jaromir