tech-userlevel: Re: utf-8 and userland

Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: James K. Lowden <jklowden@schemamania.org>
List: tech-userlevel
Date: 03/12/2004 20:49:27

On Fri, 12 Mar 2004, "Wolfgang S. Rupprecht"
<wolfgang+gnus20040312T095618@dailyplanet.dontspam.wsrcc.com> wrote:
> Oops.  So much for me double checking my environment variables before
> posting.  The env setting that the uxterm wrapper does is
> "LC_CTYPE=en_US.UTF-8".  That does indeed pop out of
> nl_langinfo(CODESET) as "UTF-8".  Is uxterm doing the right thing by
> setting LC_CYTPE that way?

That's upside-down.  xterm is an application; it *receives* the locale
settings.  The mere fact that it's started with -u8 doesn't mean the
client is set up for UTF-8.  Like, say, fonts, or, as Mouse pointed out,
file names.  

> I wonder if there is already a table of tables listing the chars that
> can safely be output for each codeset.  (Or is it sufficient to simply
> let anything that isn't a control char to pass unmolested when any
> codeset is explicitly set by the user?)

I wonder if the day isn't past that ls(1) should worry about this.  If the
filenames are UTF-8 and the fonts are UTF-8 and the xterm is UTF-8 but
LC_CTYPE is wrong, should ls refuse to copy the filenames to stdout
unmolested?  OTOH, if LC_CTYPE is UTF-8 but one of those other things is
out of whack, then it should assume everything's kosher?  Clearly,
LC_CTYPE is a pretty poor window on what's what.  

I think perhaps we should distinguish between userland utilities and
full-blown applications.  Applications can reasonably demand that the
environment be fully set up before the user clicks on the button or
whatever.  But /usr/bin/* often gets used on misconfigured or unconfigured
systems (never mind, say, when some "foreign" filesystem is on /mnt for
the time being).  The we-sell-rope principle suggests ls(1) should write
the filenames to its output, and let the bytes fall where they may.  

--jkl