NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

UTF-8 (Re: QT4 file widget doesn't see image with %C3%AD in file name)




(Posted to netbsd-users since most of it isn't directly relevant on pkgsrc-users. Also, it would be nice to see some wider discussion about this sort of thing.)


On Wed, 22 Oct 2008, Jeremy C. Reed wrote:

I don't know if this is a NetBSD problem, a QT4 problem or an LyX problem.
(I have latest LyX ready to commit to pkgsrc.)

I haven't followed pkgsrc-users in a while, and I'm not familiar with QT, LyX, or even much of NetBSD's handling of this kind of stuff.

But I know a little bit about text encoding!


So the problem here is that you have a filename with a non-ASCII character in it. (well, *duh*, you say..)

http://en.wikipedia.org/wiki/Image:Presidio_La_Bah%C3%ADa.jpg

That's URL-escaped UTF-8 encoded Unicode. The "i"-with-accent character can be represented in one byte in some encodings (0xED in ISO 8859-1), but here it is UTF-8 encoded as two bytes, and since those bytes are not valid ASCII they get escaped as %C3%AD in the URL.


Seems that the filename is saved UTF-8 encoded (with 0xC3 0xAD) too:

My xterm shows it as (with two question marks):

Presidio_La_Bah??a.jpg

"ls" refuses to emit those non-ASCII bytes and goes with '?' mangling instead.

(-q : "this is the default when output is to a terminal")


Copying and pasting the name from Mozilla beeps and loses the character:
Presidio La Baha.jpg

And when you input that into the xterm directly, via pasting, your shell takes it as a personal insult and refuses to deal with those weird foreign bytes because they smell funny.

(bytes, or just byte. Since both [0xC3 0xAD] and [0xED] is non-ASCII you'd see the same thing.)


Is this a NetBSD problem that NetBSD should fix? A NetBSD problem that Qt
should work around? A QT problem?

I'd personally guess that the immediate problem is in Qt. Both bytes should be printable as regular characters on a system that handles ISO8859-1, but the second ("soft hyphen") seems to offer some opportunity for getting it wrong. Or it might even be recognized as UTF-8 by Qt...


(It's funny, though. Opera (which uses Qt) collapses the %C3%AD in the URL to a single character in the address bar. Copy / paste from the page works just fine; both to a UTF-8-using mlterm and a ISO8859-using wterm (rxvt), which is kind of surprising if you think about it.)


But apart from that, there's still ls, the terminal emulator and the shell. You can use a good terminal emulator (x11/mlterm is nice IMO), and use "ls -w" (apparently, I just looked it up) and the filename should show up correctly in a listing.

If you can figure out how to tease your shell out from the 1970's, cut and paste should work just fine too. It is not hard, but it probably differs a lot from shell to shell.


Any suggestions?

Round up everyone who still thinks 7-bit ASCII is a good idea and deport them to Venus.


MAgnus



Home | Main Index | Thread Index | Old Index