Subject: code set recoding engine, V2
To: None <tech-kern@netbsd.org>
From: Jaromir Dolecek <dolecek@ics.muni.cz>
List: tech-kern
Date: 11/22/1999 00:53:13
Hi,
after the recent discussion and received feedback, I've revorked
the API and removed the unicodisms.  The API is now somewhat
similar to one of iconv(), on Nariyuki Soda's proposal sent in private mail.
I used "codeset_" prefix though, as the API is not 100% the same.  The
code is still only able to recode from unicode to the few code sets
(iso-8859-[12], koi8-r, euc-jp), but that's just an implementation
detail :)

Other changes:
* euc-jp translation changed to use sorted hashes; the lookup is about
  3 times slower than with static mapping table, but the result
  saves about 10KB on i386 (so it's about 38KB compiled on i386);
  I'm not sure whether the performance penalty is worth the space,
  IMHO not; the program used to generate the hash tables
  is in subdirectory gen/
* iso-8859-[12]: codes 128-159 are properly mapped, too
* if code can't be converted to target codeset, it's mangled
  to ASCII-only code (1-255 was used previously); every codeset contains ASCII,
  but we should not assume codes 128-255 have meaningful value in target
  code set (or are displayable)
* since all (supported) code sets contain ASCII, codes <= 127
  are not recoded via code set-specific tables and the value
  is returned unchanged; this means the conversion might be slighly
  faster for typical ASCII-mostly text/filenames

The proposed API to code set recoding engine follows. Note that
codeset_t is used instead of former "const struct codeset_table",
to enhance readibility and encapsulation.

codeset_t *codeset_open __P((const char *tocode, const char *fromcode));
int codeset_close __P((codeset_t *codeset));

ssize_t codeset_conv __P((codeset_t *codeset, char *dst, size_t dstlen,
				const void *src, size_t srclen));
u_int32_t codeset_convc __P((codeset_t *codeset, u_int32_t code));
u_int32_t codeset_readc __P((codeset_t *codeset, const char *src,
				char const **result));

* codeset_open() should prepare structure used for converting from
  one code set to another. Right now, it returns just pointer to compiled-in
  code set structure, but this may change in future. Note that the only input
  code set supported is currently "unicode".
* codeset_close() frees any resources allocated by codeset_open();
  it does nothing currently
* codeset_conv() converts an ``srclen'' array/string codes;
  the ``src'' pointer has to point on array of items of "right" size,
  for example u_int16_t for unicode and is implicitly treated as
  such. We can't use strong type checking, because me might want
  to support other-than-unicode code sets on input;
  the function returns length of result string or -1 if an error occurs -
  e.g. if the target string is not long enough to hold resulting string
* codeset_convc() converts single code
* codeset_readc() reads a code from it's encoded string representation;
  if ``result'' is not null, it is updated to point at the position on which
  next code should be read

Any comments ?

Jaromir
-- 
Jaromir Dolecek <jdolecek@NetBSD.org>      http://www.ics.muni.cz/~dolecek/
@@@@  Wanna a real operating system ? Go and get NetBSD, damn it!  @@@@