Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-userlevel
Date: 03/12/2004 14:20:26
> If anyone needs convincing that utf-8 is a *good thing* this should
> do it.

I'd rather just use 32-bit chars.  But that's probably just me.

> Now for the netbsd content.  UTF-8 is designed so that it should have
> no impact on most programs that touch utf-8 content unless they are
> themselves drawing the screen content or arranging that screen output
> is nicely justified.  [...file names...]  Things "just work" from
> 8-bit clean programs.

Not when trying to interoperate with output from other 8-bit-clean
programs that use non-UTF-8 (eg, 8859-1).  What will your UTF-8-aware
ls-and-uxterm do with a file named "École" created by an 8859-1 user
program?  (Mangle it, almost undoubtedly, since the octet c9 looks like
the first octet of a two-octet UTF-8 sequence but the following octet,
63, is not a valid second octet for such a sequence.  "׫foo»×" will
get mangled too, but differently, and the ׫ fundamentally differently
from the »×.)

This differs from the mangling performed by (say) using 8859-8 to
access files named using 8859-1; the latter will show the wrong
characters, but will preserve them.  The former will mangle them
irreversibly - that file named École, if read into a UTF-8-name-aware
editor and written back out again, isn't going to be named with the
same octet sequence.

Of course, this is just a time-delayed version of the interoperability
problems encountered when (say) trying to pipe output from a program
that writes 8859-1 into a program expecting UTF-8, only done by saving
the "output" octet sequences as a file name for the second program to
read.

But yes, I agree that setting a UTF-8 locale should cause programs like
ls to consider UTF-8 octet streams as safe to print, just as setting an
8859-* locale should cause programs like ls to consider 8859-*
printable octet streams as safe to print.  (In the case of UTF-8 this
is more involved than it is for 8859; that's not directly relevant.)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B