tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: UTF8 (Was: [PATCH] replace 0xA0 to whitespace in plain text files (part 2))



> ...which brings up the question if NetBSD shouldn't go UTF8 anyway?

Depends.  Go to UTF-8 for what?

One of the biggest problems is that there are a whole lot of places
where NetBSD - and Unix more generally - has traditionally not used
character streams, instead using byte streams - characters are
converted to octets very early on input and converted back very late on
output; very nearly everything works with octet streams, not character
streams.

Even things like the C compiler don't really work with characters.  The
text "if" in a source file is not really "character i" "character f";
rather, it's "octet 0x69" "octet 0x66" (well, typically - that's
assuming ASCII was used when the compiler was built) - as it has to be,
since there is no way to declare what character set the input uses.
(Part of the problem is C proper, actually, since it conflates the
notions of "character" and "small integer".  This can be fixed by, for
example, being very strict with the distinction between "char" and
"int8_t"/"uint8_t" (and effectively doing away with "unsigned char",
since it makes no sense except when considering "char" to actually mean
"octet-sized integer type" rather than "character"), but without making
those into truly separate types it's very hard to get that wholly
right.)

The closest NetBSD gets to having a character set at present is that a
lot of the shipped octet sequences don't make sense unless interpreted
using a character set which has ASCII as a subset.  (Some, notably
executables, don't make sense when interpreted as characters at all,
regardless of the character set in use.)

Changing this would be a lot more invasive than it appears, involving a
fundamental philosophical shift in the whole system.  Some files are
unavoidably not files of characters (such as the executables mentioned
above), so it's not as simple as saying "everything is UTF-8" (or
8859-1 or whatever).  Without converting everything that handles octets
(which means very nearly the entire system) to carry character-set
information along with those octets, this is not truly fixable.

I became very aware of this when I tried to implement SSH.  As it
stands the spec is basically unimplementable on anything Unixy, because
it specifies a character set for many pieces of data on the wire
(usernames are one simple example).  But the implementation doesn't
have character sequences for, say, usernames; all it has is octet
sequences!  (I punted, ignoring the issue except for documenting it.)

/~\ The ASCII                           der Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index