tech-kern: codeset recoding engine

Subject: codeset recoding engine
To: None <tech-kern@netbsd.org>
From: Jaromir Dolecek <dolecek@ics.muni.cz>
List: tech-kern
Date: 11/11/1999 23:48:52
Hi,
I've got together the code so it's IMHO suitable for addition to
the tree. The code is targetted to be used in kernel and is
currently able to recode Unicode characters to iso-8859-1, iso-8859-2,
koi-r & eucjp. Translation to utf-8 encoding is also supported. The engine
would be primarily used by ntfs and cd9660-joliet, with msdosfs
possibly to come later. Most likely, the filesystems will
accept the desired codeset as mount option.

The engine is based on code written by Motomichi Matsuzaki, originally
as a set of patches for FreeBSD to support Unicode names for
Joliet-style CDs. The code is heavily hacked by me though, especially
the interface to the engine is totally reworked and the code is actually
commented :)

Note that previously proposed "no conversion" is not currently
supported, as it doesn't make sense for Unicode (i.e. the standard
UCS2-form is just plain unusable, even the very standard NetBSD
utlities would be driven mad by it).

The interface to recode engine constitutes from four basic functions:

const struct codeset_table *get_codeset __P((const char *));
u_int32_t unicode_convert __P((const struct codeset_table *, unicode_t));
ssize_t unicode_convert_string __P((const struct codeset_table *,
                        const unicode_t *, size_t, char *, size_t));
u_int32_t codeset_getrune __P((const struct codeset_table *, const char *,
                        char const **));

get_codeset() returns a pointer to structure used for recoding,
the pointer is later passed to unicode_convert(), unicode_convert_string()
and codeset_getrune().

unicode_convert() converts single unicode character to the representation
of target codeset.

unicode_convert_string() recodes string of unicode characters
into string of target characters. Appropriate encoding is applied,
so that e.g. eucjp is encoded to euc or common Unicode to utf-8.

codeset_getrune() is for reading the encoded rune off a (char *) string.

Note that recoding from user codeset to Unicode is currently not
supported. If ever msdosfs would use the engine for converting
long filename entries, this would need to get implemented.

The engine tries hard to DTRT with Unicode characters which do not
map into target codeset. In that case, the character's lower
8 bits are used to form a random 8bit character used as a replacement,
not-null Unicode characters with lower 8 bits equal to zero
are mapped to '?'. This is "legal" to do, as codes 0-255 are part
of every currently supported codeset.

Any comments, recommendations or flames are welcome. Note that
this code doesn't intent to be the universal codeset
recoder; it merely serves for recoding between Unicode and
user-defined codeset. Possible extensions and ideas for future work have
been already discussed here a while ago. I'm aware this solution
is not ideal, but it satisfies the need for Unicode filename recoder
for me and it would be usable to other people, too.

The codeset code is available for review on
	http://www.ics.muni.cz/~dolecek/NetBSD/codeset.tar.gz

If no serious objections would be raised shortly, I'm going to
commit the codeset engine as well as appropriate changes to
cd9660 & ntfs during this weekend or early next week.

One remaining issue - I'm not an eglish speaker, so I don't know
whether term "codeset" or "charset" should be used. "Charset"
is used by virtually everyone else, but "codeset" seems to be slighly
better, as we are working with set of codes (i.e. the numbers
used to represent characters) and not the actual character sets.
I'm not attached to using "codeset" though, I'd like to use whatever
is gramatically & semantically OK.
Opinions ?

Jaromir
-- 
Jaromir Dolecek <jdolecek@NetBSD.org>      http://www.ics.muni.cz/~dolecek/
"The only way to get rid temptation is to yield to it." -- Oscar Wilde