tech-userlevel: Re: utf-8 and userland

Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: James K. Lowden <jklowden@schemamania.org>
List: tech-userlevel
Date: 03/13/2004 17:41:20
On Fri, 12 Mar 2004, der Mouse <mouse@Rodents.Montreal.QC.CA> wrote:
> > Clearly, LC_CTYPE is a pretty poor window on what's what.
> 
> Yes, but it is the only one ls has.  

It could interrogate the terminal.  

> Setting your environment variables
> so as to lie to ls about what kind of display environment you have
> counts, to my mind, as pilot error pure and simple.  LC_CTYPE and its
> ilk exist in order to make it possible to communicate that information
> to applications, after all.

Hmm.  LC_CTYPE is very handy for telling applications what character set
to render data in.  It's no use at all for describing terminal
characteristics.  

The filesystem has no label indicating the encoding of its metadata.  It's
not an error -- as far as the filesystem is concerned -- to have different
filenames encoded differently.  

You're worried about:

	$ ls

but what about:

	$ ls | xmessage -file -
or 	$ ls | hexdump -C

Do you want ls(1) to mediate according to LC_CTYPE then, too?  If so, how
in the world is an administrator expected to discover the real filename?  

Do you think it's unlikely or unreasonable for a single local disk to have
differently encoded filenames?  How far away is the day, do you suppose,
when distfiles contain UTF-8-encoded names?  Distant, perhaps, among us
English speakers, but the transition to UTF-8 will be more rapid (and
mixed) elsewhere.  

> > The we-sell-rope principle suggests ls(1) should write the filenames
> > to its output, and let the bytes fall where they may.
> 
> Yes, and I'm at least somewhat inclined to that point of view.  But I'm
> also somewhat inclined to the contrary; I don't like the idea of
> someone naming a file with an escape sequence to program a terminal's
> answerback message and then request the terminal send its answerback,
> then wait until root does an ls on it.

You're only waiting until "root does ls on it" when LC_CTYPE is mis-set,
or when -u8 wasn't invoked on the xterm.  

I have to admit I don't understand the attack, if that's what it is.  So,
the filename has some magic sequence that the terminal reacts to, and it
spits it out on the screen.  Unless the filename is a whole program that
somehow gets xterm (presumably) to do something, the only consequence is
xterm output on the screen.  What am I missing?  

At any rate, the problem isn't limited to ls(1) in any way.  Quite a lot
of /usr/bin writes to stdout.  If we're going to protect from writing
certain stuff to the terminal, we might as well pass every
terminal-destined output through svis(3).  And establish for it a
system-wide definition, by terminal type, of "safe".  

Not that we should bother, IMHO.  The Day of the GUI arriveth.  Ordinary
users don't putz with ls and friends if Nautilus is around, and anyway
they are protected by sysadmins who see to it that everything is
consistently encoded.  People relying on /usr/bin in all its glory rightly
expect that What You Get Is What You Have.   If the tools are dumb and
reliable, the user might be confused, but at least he won't be fooled.  

Interesting problem.  I don't know the answer, even if I sound like I
think I do.  :-)

--jkl