Subject: utf-8 and userland
To: None <>
From: Wolfgang S. Rupprecht <>
List: tech-userlevel
Date: 03/12/2004 10:44:04
I noticed the new Xfree86 has a utf-8 aware xterm (when invoked via
"uxterm").  The cool thing is that utf-8 is essentially a 32-bit
character set with a run-length encoding.  One gets access to
mathematical symbols and all the national encodings without switching
character-encoding modes.  The run length encoding is carefully chosen
so that no NULL bytes appear anywhere in the data stream, and the low
127 "us-ascii" chars are all where pre-existing programs expect them
to be.  In short it is very ascii compatible and unless the program is
displaying something on the screen, it can usually be blissfully
unaware of the fact that the encoding is slightly weird.

A cool demo is to grab this file and then "cat" it from inside uxterm.
If anyone needs convincing that utf-8 is a *good thing* this should do

Backgrounder/FAQ for UTF-8:

Now for the netbsd content.  UTF-8 is designed so that it should have
no impact on most programs that touch utf-8 content unless they are
themselves drawing the screen content or arranging that screen output
is nicely justified.  The impact on the kernel appears to be nil.  I
have been able to make files with utf-8 filenames and read to and
write the file with no problem.  Things "just work" from 8-bit clean

The fly in the ointment is that some programs mash the high bits or
otherwise sensor certain bytes.  Most notably ls(1) has a routine
called safe_print() that is anything but safe for UTF-8.  Is this just
a hold-over that can be switched off (or at least turned down a bit)
when the LC_LANG is UTF-8?  I'm willing to submit patches if it will
move things along.  I just don't want to bother if nobody wants it.

Wolfgang S. Rupprecht