[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
> 1. there's a fundamentally nasty problem that UNIX itself has never
> dealt with in a general way: what's in that file? Text? MPEG
> streams? JPEG?
I would disagree that UNIX has never dealt with that problem. I would
say that UNIX _has_ dealt with it, by explicitly pushing it off to the
application layer, and that, indeed, that is where much of its power
and flexibility comes from. I regularly do useful things by treating
data as if it were of a type it originally wasn't intended to be.
This approach has problems, of course - perhaps most notably at the
moment, the conflict between octet strings and character strings - but,
well, try to find an approach to anything that doesn't have problems.
> The answer has typically been, "well, most of our software handles
> text (meaning 7-bit ASCII), [...]" [...]
> Extending that to ISO-8859-1 ("ISO Latin 1") was easy ("make
> everything 8-bit clean! No, that's not a parity bit any more!")
There's a critical point here: doing that did _not_ just extend that to
Latin-1: it extended it to whatever charset and encoding your input and
display devices felt like using. I can work (and have worked) with
8859-7 text simply by starting a new terminal emulator with an 8859-7
font. (Input is a little more awkward than 8859-1 input, but that's
because I've put more effort into 8859-1 input than 8859-7 input, not
because there's anything fundamentally more difficult about -7.) There
are lots of Linux users that use UTF-8 regularly, because they have
input and output setups that make UTF-8 easy compared to other
encodings and charsets. (Not that there's anything fundamentally
different about Linux in this regard; they've just put the time into
making their software support UTF-8. There may be others; Linux is
just the one I'm aware of because I've run into it personally.)
> Theoretically, the POSIX locale stuff is supposed to handle things
> beyond that, but it's a more complicated and subtle problem than
> those POSIX committees really thought about. Just setting LANG
> environment variable (and its associates) to where ever you are or
> whatever you speak/read really tells the system and software exactly
> nothing about the content of the files you are manipulating - LANG
> speaks more of I/O to you, i.e. what you're prepared to read on a
> display, and what sorts of characters you'll be inputting from your
> ... input devices.
Right. It does its job: it replaces the old "text is ASCII" assumption
with a variable "text is $LANG" assumption (I'm deliberately glossing
over many details here, but that's what it amounts to from the point of
view of this discussion).
> I commend this well written paper to your attention:
Well-written, perhaps. But I'm not convinced their choices are
particularly good ones.
In particular, if you want to use Unicode, I think you should stop
trying to use octets for character strings in any form: I think char
should be a 16-bit type, basically. (It's not quite that simple,
mostly because of all the octet streams that you'll want to handle, and
by definition char is the smallest integral type.) A bit like Plan9's
Rune, but without the UTF-8 form.
> 2. With regard to file contents, there are three approaches:
> guessing, assuming a default, or explicit meta-data [...].
Actually, assuming a default is just a special case of guessing.
For that matter, so is explicit meta-data; it amounts to guessing that
the labeling is accurate. I regularly see mislabeled data, perhaps
most commonly email labeled as ISO-8859-1 but containing octets in the
0x80-0x9f range, which are not 8859-1 text. This substantially impairs
my confidence that a metadata-based scheme will have accurate metadata.
I went through my larval phase under VMS, which uses a fairly elaborate
metadata scheme. For all the benefits it had, I still found myself
regularly using CONVERT/FDL to rewrite the metadata attached to file
contents so I could use tools that didn't understand what the original
> [...] ... but then I don't use X11 (too many years of using the much
> more thoroughly integrated MacOS environment has spoiled me - every
> time I try to use X, I have to work hard to suppress the strong
> desire to do violence to the people responsible for it).
How odd. Every time I have occasion to subject myself to a Mac UI, I
find myself with related feelings. I don't know whether it's just a
question of what we're used to or whether there's something different
between us that makes us better matches to different UI styles.
Also, you may be confusing X with some common window system built on X.
It would be entirely possible to build a UI as thoroughly integrated as
the Mac one is atop X. (I don't know why it hasn't been done, or why
it hasn't gained wide popularity if it has.) X is not a window system,
despite being named as one; it's really a framework for building window
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Main Index |
Thread Index |