Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: James K. Lowden <email@example.com>
Date: 03/12/2004 20:49:27
On Fri, 12 Mar 2004, "Wolfgang S. Rupprecht"
> Oops. So much for me double checking my environment variables before
> posting. The env setting that the uxterm wrapper does is
> "LC_CTYPE=en_US.UTF-8". That does indeed pop out of
> nl_langinfo(CODESET) as "UTF-8". Is uxterm doing the right thing by
> setting LC_CYTPE that way?
That's upside-down. xterm is an application; it *receives* the locale
settings. The mere fact that it's started with -u8 doesn't mean the
client is set up for UTF-8. Like, say, fonts, or, as Mouse pointed out,
> I wonder if there is already a table of tables listing the chars that
> can safely be output for each codeset. (Or is it sufficient to simply
> let anything that isn't a control char to pass unmolested when any
> codeset is explicitly set by the user?)
I wonder if the day isn't past that ls(1) should worry about this. If the
filenames are UTF-8 and the fonts are UTF-8 and the xterm is UTF-8 but
LC_CTYPE is wrong, should ls refuse to copy the filenames to stdout
unmolested? OTOH, if LC_CTYPE is UTF-8 but one of those other things is
out of whack, then it should assume everything's kosher? Clearly,
LC_CTYPE is a pretty poor window on what's what.
I think perhaps we should distinguish between userland utilities and
full-blown applications. Applications can reasonably demand that the
environment be fully set up before the user clicks on the button or
whatever. But /usr/bin/* often gets used on misconfigured or unconfigured
systems (never mind, say, when some "foreign" filesystem is on /mnt for
the time being). The we-sell-rope principle suggests ls(1) should write
the filenames to its output, and let the bytes fall where they may.