tech-kern: Re: code set recoding engine, V2

Subject: Re: code set recoding engine, V2
To: Noriyuki Soda <soda@sra.co.jp>
From: Jaromir Dolecek <dolecek@ics.muni.cz>
List: tech-kern
Date: 12/13/1999 11:27:38
Noriyuki Soda wrote:
> Mmm, it seems that the above `codeset' doesn't really mean codeset,
> but means `code converter'.

Yeah, that's true. The codeset_* should be renamed to codeconv_*
or kcodeconv_* - I like the last one, personally.

> And the above interface has problem which iconv(3) interface doesn't
> have. For example, if a interface intends to be used for partial
> conversion, it should take `conversion state' argument, since
> 	- One character in one codeset might be multiple characters
> 	  in another codeset, and vice versa, because a model what 
> 	  a character means is different between various codesets.
> 	- Stateful encoding requires conversion state as it's nature.

Currently, yes. But handling partial conversion/conversion state
is matter of changing codeset_t (or [k]codeconv_t), changing
the internal implementation and recompiling the kernel. I.E.
the outside interface doesn't need to be changed. So I'd
postpone solving this until it would be really needed :)

The issue would be a bit more complicated once it would be
possible to load codeconv LKMs, but thinking about it
is a bit premature and not an issue right now :)

> Thus, a interface just like standard iconv(3) should be needed.
> 
> Or, perhaps, a interface like below might be a candicate, because:
> 	- iconv_open(3) might be slow, if per-process pathname-codeset 
> 	  is supported.
> 	  (the following kcodeset_t intends to avoid this problem)
> 	- and iconv(3) might be slow too, since it requires dynamic
> 	  memory allocation.
> 	  (the following interface uses fixed size type for conversion
> 	   state, although this certainly limits implementation.)

[some staff snipped]

Well, currently, no codeset conversion for character sets using/needing
partial conversions is supported, so I'd not bother to try to come
out with all-dancing-all-singing solution before actual need arises.
The internal implementation could be enhanced later as needed.

Anyway, I think that single [k]codeconv_*() routines would be enough
and they should do all necessary staff. I'd not blur the interface
by having extra kcodeset_open() & kcodeconv_open() and all that
staff.

> Another problem of the proposed interface is that the codeset_readc()
> is not really conversion function between various codesets, but 
> a conversion function between multibyte representation to wide character
> representation in same codeset. (like mbrtowc(3)).

Yeah.

> The length argument definitely should be byte count (on multibyte
> interface).

Okay.

> P.S.
> I'll think about a interface which can be used as a substitute for
> codeset_convc()/codeset_readc(). Are these routines intended to
> be used for case fold comparison in pathname lookup?

Yes, they are. For case-sensitive comparing of unconverted and
already converted string, a separate function might be exported by
API (say kcodeconv_strcmp()), but some filesystems might need to
do case-insensitive lookup of Unicode pathnames and I feel uneasy
with putting whole unicode uppercase mapping table to kernel (lots
of Unicode don't have uppercase chars, so the map wouldn't be
necessarily 128KB big, but anyway).  e.g. NTFS needs really full
uppercase mapping table, as the filenames are indexed in directories
case-insensitively. I'll try to find out how big the uppercase
mapping table would need to be, maybe it would not be quite that
bad, stay tuned.

Jaromir
-- 
Jaromir Dolecek <jdolecek@NetBSD.org>      http://www.ics.muni.cz/~dolecek/
@@@@  Wanna a real operating system ? Go and get NetBSD, damn!  @@@@