Subject: Re: codeset recoding engine
To: Erik Bertelsen <erik@mediator.uni-c.dk>
From: Jaromir Dolecek <dolecek@ics.muni.cz>
List: tech-kern
Date: 11/14/1999 12:23:09
Erik Bertelsen wrote:
> Please be careful about the terminology: In my understanding,  UTF-8 is -not-
> a character code (character set), but an encoding of multibyte characters into
> a sequence of bytes that are safely transmittable over a pure 7-bit ASCII
> channel.
> 
> UTF-8 may be used to encode characters in several character codes (sets), e.g.
> LATIN-1 and UNICODE. Note that even for LATIN-1, UTF-8 is not
> the identity mapping.
> I also think (but am not 100% sure) that UTF-8 is able to encode
> full ISO 10646 characters if needed.

The catch here is that when an application would convert the UTF-8 back
to it's previous representation, it could be well worthless - an
application needs kernel to pass it all filenames in one well-defined
codeset. If it's not true, app would get just some random 2 byte or 4 byte
code and that doesn't help.

UTF-8 is just transport encoding. You have to recode the original
codeset to some "universal" one prior to converting the result
to UTF-8. If it's not possible, we loose.

Itojun: could you give me an example of codeset which is not handled
by Unicode (2 byte codeset) and/or ISO-10646 (4 byte codeset) ?

Jaromir
-- 
Jaromir Dolecek <jdolecek@NetBSD.org>      http://www.ics.muni.cz/~dolecek/
"It's IMPOSSIBLE to overcomment any code. It can only be undercommented."
@@@@  Wanna a real operating system ? Go and get NetBSD, damn it!  @@@@