Subject: Re: codeset recoding engine
To: Erik Bertelsen <erik@mediator.uni-c.dk>
From: None <itojun@iijlab.net>
List: tech-kern
Date: 11/14/1999 17:56:22
	I did not have enough coffee.  Let me rephrase.

>Please be careful about the terminology: In my understanding,  UTF-8 is -not-
>a character code (character set), but an encoding of multibyte characters into
>a sequence of bytes that are safely transmittable over a pure 7-bit ASCII
>channel.
>
>UTF-8 may be used to encode characters in several character codes (sets), e.g.
>LATIN-1 and UNICODE. Note that even for LATIN-1, UTF-8 is not the identity mapping.

	This statement is not really true.  You seem to assume UCS-4 here.
	(Note that Latin-1 has 1-to-1 mapping with UCS-2 or UCS-4)
	If this observation is wrong, correct me...

>I also think (but am not 100% sure) that UTF-8 is able to encode full ISO 10646
>characters if needed.

	In the above, you say that you are going to assume UCS-4 (or it
	seems so).  Please don't ever, ever hardcode something to UCS-2 (ISO
	10646) nor UCS-4.
 	There are character sets that contain characters that cannot be 
	converted into characters in UCS-2, or UCS-4.  Hence, you can't
	put that character into UTF-8 stream.
> >> > I think you need two conversons:
> >> > kernel: filesystem-charset to utf-8
> >> > then
> >> > userland: utf-8 to LC_CHARSET.
	The above two-step conversion assumes the following items:
	- every characters in any character set can be converted into UCS-4
		In other word, you are assuming that there'll be no information
		loss in "filesystem-charset -> utf-8" conversion.
	- locale library uses UCS-4 as internal encoding for wchar_t
	  (or, every runelocale internal encodings for rune_t in BSD
	  runelocale library uses UCS-4).
	The above two assumptions does not hold.

	Also, there's no good way for runelocale library to handle characters
	outside of what LC_CHARSET capable to handle (for example, if
	you mount Chinese filesystem while your LC_CHARSET is for Japanese,
	you wil be in a big trouble).

itojun