Subject: Re: code set recoding engine, V2
To: None <dolecek@ics.muni.cz, thorpej@nas.nasa.gov>
From: Noriyuki Soda <soda@sra.co.jp>
List: tech-kern
Date: 11/23/1999 02:10:19
> after the recent discussion and received feedback, I've revorked
> the API and removed the unicodisms.  The API is now somewhat
> similar to one of iconv(), on Nariyuki Soda's proposal sent in private mail.

Hi, I'm sorry to be late (I was away from my home since Nov 11),
and sorry that I was not clear about what "similar to iconv()" means.

> I used "codeset_" prefix though, as the API is not 100% the same. 

> The proposed API to code set recoding engine follows. Note that
> codeset_t is used instead of former "const struct codeset_table",
> to enhance readibility and encapsulation.
> 
> codeset_t *codeset_open __P((const char *tocode, const char *fromcode));
> int codeset_close __P((codeset_t *codeset));
> 
> ssize_t codeset_conv __P((codeset_t *codeset, char *dst, size_t dstlen,
> 				const void *src, size_t srclen));
> u_int32_t codeset_convc __P((codeset_t *codeset, u_int32_t code));
> u_int32_t codeset_readc __P((codeset_t *codeset, const char *src,
> 				char const **result));

Mmm, it seems that the above `codeset' doesn't really mean codeset,
but means `code converter'.

And the above interface has problem which iconv(3) interface doesn't
have. For example, if a interface intends to be used for partial
conversion, it should take `conversion state' argument, since
	- One character in one codeset might be multiple characters
	  in another codeset, and vice versa, because a model what 
	  a character means is different between various codesets.
	- Stateful encoding requires conversion state as it's nature.

Thus, a interface just like standard iconv(3) should be needed.

Or, perhaps, a interface like below might be a candicate, because:
	- iconv_open(3) might be slow, if per-process pathname-codeset 
	  is supported.
	  (the following kcodeset_t intends to avoid this problem)
	- and iconv(3) might be slow too, since it requires dynamic
	  memory allocation.
	  (the following interface uses fixed size type for conversion
	   state, although this certainly limits implementation.)
/*
 * for kernel internal interface.
 * API between kernel and userland should use string as interface.
 * And this value is allocated dynamically (i.e. not #define'ed).
 */
typedef int kcodeset_t; 

/*
 * Fixed size placeholder for codeconverter LKMs.
 * Actual implementation differs between codeconverters.
 */
typedef struct {
	char placeholder[32];
} kcodeconv_t ;

int kcodeset_open __P((kcodeset_t *codeset,
   const char *codeset_name));
int kcodeset_close __P((kcodeset_t codeset));

int kcodeconv_open __P((kcodeconv_t *converter,
    kcodeset_t dstcode, kcodeset_t srccode));
int kcodeconv __P((kcodeconv_t *converter,
    char **dst, size_t *dstbytes_left,
    const char **src, size_t *srcbytes_left,
    size_t *non_identical_conversions));
int kcodeconv_close __P((kcodeconv_t *converter));

For non-partial conversion, the following interfaces might be
used:
	int kcodeconv_string __P((
	    kcodeset_t dstcode, char *dst, size_t *dst_bytes,
	    kcodeset_t srccode, const char *src, size_t src_bytes,
	    size_t *non_identical_conversions));

Another problem of the proposed interface is that the codeset_readc()
is not really conversion function between various codesets, but 
a conversion function between multibyte representation to wide character
representation in same codeset. (like mbrtowc(3)).

> Jason Thorpe wrote:

> > Is `srclen' an absolute byte (octet) count, or a relative `code'
> > count?  I.e.  for a src consisting of 4 unicode codes (2 octets
> > per code, right?), is it `4' or `8'?

> It is the 'code' count. For your example, it's 4.

Unfortunately, this doesn't work, since the conversion function might
be used from a filesystem independent code fragment, and the code
fragment does'nt know about how long one `code' occupies. 
For example, UTF-8 (one of representations of Unicode) takes from 1 to 6
bytes for one character, and UTF-16 (another representation of Unicode)
takes 2 to 4 bytes for one character.
The length argument definitely should be byte count (on multibyte
interface).

P.S.
I'll think about a interface which can be used as a substitute for
codeset_convc()/codeset_readc(). Are these routines intended to
be used for case fold comparison in pathname lookup?
--
soda