tech-kern: Re: Unicode support in iso9660.

Subject: Re: Unicode support in iso9660.
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
From: Christopher Vance <christopher@nu.org>
List: tech-kern
Date: 11/24/2004 00:31:04

On Tue, Nov 23, 2004 at 07:41:30AM -0500, der Mouse wrote:
>> It would handle it fine yes.  I thus think that UTF-8 (wich supports
>> upto 32+ bits chars) would be fine for this.
>
>I've seen this said before in this thread (that UTF-8 supports 32-bit
>characters).
>
>RFC 3629, the latest UTF-8 spec I find, disagrees; it does not support
>anything above 10ffff (though it could go as high as 1fffff without too
>much trouble, if I'm reading it right, and could be extended to greater
>widths in a tolerably obvious way).

The Unicode Consortium defines characters from about 0 to 10ffff (21
bits, and not even all of that).  ISO 10646 in theory allows 31 bits
(so as not to be negative if using a 32 bit container) but the
maintainers have promised not to allocate characters beyond the
Unicode maximum.

IOW, UTF-8 as originally defined would have handled larger numbers
than are now necessary.

The original definition is somewhere in the Plan 9 documentation,
where it was called FSS-UTF or UTF-FSS, or something like that.  The
official definition is in one of the appendices to ISO 10646.  I'm not
sure whether the Unicode people cut their definition of UTF-8 back to
the number of bytes necessary for 21 bits, or kept theirs the same as
the ISO definition.  The RFC is presumably an attempt to adopt that
into something more accessible.

My copy of 10646 is buried in a box, and I'd have to step over a
sleeping dog to find my copy of the Unicode book.  I haven't looked at
the RFC.

-- 
Christopher Vance