Subject: Re: isprint()
To: None <perry@piermont.com>
From: Terry Moore <tmm@mcci.com>
List: current-users
Date: 08/25/1996 00:49:19
> UNICODE in its UTF encoding is sort of ASCII compatible, but the extra
> characters it supports frequently come in upper and lower case
> varieties.

UTF8 is completely ASCII compatible, and was originally devised by
Ken Thomson to allow Unicode to be used in UNIX w/o switching the
kernel to 16-bit characters.  See:

	http://www.stonehand.com/unicode/standard/fss-utf.html

for Thomson's original proposal, which is quite clear, and

	http://www.stonehand.com/unicode/standard/wg2n1036.html

for the final definition of UTF8, which is a standards document.

> Unicode is 16 bits, not 32 -- UTF is an encoding of UNICODE that
> allows it to be "mostly" compatible with ASCII -- that is, ASCII files
> are valid UTF but not always vice versa...

UTF8 encodes ISO 10646 (32-bit) characters. IOS10646 is apparently
limited (by the spec) to approximately 2^20 distinct characters.  UTF16 
is defined as a way to map the characters that are outside the low 
2^16 range into the UNICODE range.  The mapping is defined to be 
unambiguous with two different character ranges representing the 
most significant and least significant 10 bits of the bigger characters,
and both ranges selected from the formerly reserved range of UNICODE.

See:
	http://www.stonehand.com/unicode/standard/utf16.html
and
	http://www.stonehand.com/unicode/standard/wg2n1035.html

for more info.

> UNICODE has excellent support for Chinese characters, although there
> are some complaints about the Han unification that was done. UNICODE's
> set is actually a superset of all the national sets in the far east,
> though, but with a differing order.

The UNICODE consortium has adopted ISO 10696, partly because (regardless
of what how feels about Han unification) 65536 characters aren't enough.
As the spec says one million characters is probably enough for now:  or 
to quote directly from the above URL from the UNICODE website:

  "This means that there will be 14*65536 = 917504 code positions for new 
  standardized characters and 131072 additional private use code positions. 
  Given these numbers, it is rather unlikely than any other portion of the
  UCS-4 encoding space will be employed."

Best regards,
Terry Moore
tmm@mcci.com	tel: +1-607-277-1029	fax: +1-607-277-6844