Re: wide characters and i18n

To: tech-userlevel%NetBSD.org@localhost
Subject: Re: wide characters and i18n
From: der Mouse <mouse%Rodents-Montreal.ORG@localhost>
Date: Thu, 15 Jul 2010 09:28:50 -0400 (EDT)

> 1. there's a fundamentally nasty problem that UNIX itself has never
> dealt with in a general way: what's in that file?  Text?  MPEG
> streams?  JPEG?

I would disagree that UNIX has never dealt with that problem.  I would
say that UNIX _has_ dealt with it, by explicitly pushing it off to the
application layer, and that, indeed, that is where much of its power
and flexibility comes from.  I regularly do useful things by treating
data as if it were of a type it originally wasn't intended to be.

This approach has problems, of course - perhaps most notably at the
moment, the conflict between octet strings and character strings - but,
well, try to find an approach to anything that doesn't have problems.

> The answer has typically been, "well, most of our software handles
> text (meaning 7-bit ASCII), [...]" [...]

> Extending that to ISO-8859-1 ("ISO Latin 1") was easy ("make
> everything 8-bit clean! No, that's not a parity bit any more!")

There's a critical point here: doing that did _not_ just extend that to
Latin-1: it extended it to whatever charset and encoding your input and
display devices felt like using.  I can work (and have worked) with
8859-7 text simply by starting a new terminal emulator with an 8859-7
font.  (Input is a little more awkward than 8859-1 input, but that's
because I've put more effort into 8859-1 input than 8859-7 input, not
because there's anything fundamentally more difficult about -7.)  There
are lots of Linux users that use UTF-8 regularly, because they have
input and output setups that make UTF-8 easy compared to other
encodings and charsets.  (Not that there's anything fundamentally
different about Linux in this regard; they've just put the time into
making their software support UTF-8.  There may be others; Linux is
just the one I'm aware of because I've run into it personally.)

> Theoretically, the POSIX locale stuff is supposed to handle things
> beyond that, but it's a more complicated and subtle problem than
> those POSIX committees really thought about.  Just setting LANG
> environment variable (and its associates) to where ever you are or
> whatever you speak/read really tells the system and software exactly
> nothing about the content of the files you are manipulating - LANG
> speaks more of I/O to you, i.e. what you're prepared to read on a
> display, and what sorts of characters you'll be inputting from your
> ... input devices.

Right.  It does its job: it replaces the old "text is ASCII" assumption
with a variable "text is $LANG" assumption (I'm deliberately glossing
over many details here, but that's what it amounts to from the point of
view of this discussion).

> I commend this well written paper to your attention:

> http://plan9.bell-labs.com/sys/doc/utf.html

Well-written, perhaps.  But I'm not convinced their choices are
particularly good ones.

In particular, if you want to use Unicode, I think you should stop
trying to use octets for character strings in any form: I think char
should be a 16-bit type, basically.  (It's not quite that simple,
mostly because of all the octet streams that you'll want to handle, and
by definition char is the smallest integral type.)  A bit like Plan9's
Rune, but without the UTF-8 form.

> 2. With regard to file contents, there are three approaches:
> guessing, assuming a default, or explicit meta-data [...].

Actually, assuming a default is just a special case of guessing.

For that matter, so is explicit meta-data; it amounts to guessing that
the labeling is accurate.  I regularly see mislabeled data, perhaps
most commonly email labeled as ISO-8859-1 but containing octets in the
0x80-0x9f range, which are not 8859-1 text.  This substantially impairs
my confidence that a metadata-based scheme will have accurate metadata.

I went through my larval phase under VMS, which uses a fairly elaborate
metadata scheme.  For all the benefits it had, I still found myself
regularly using CONVERT/FDL to rewrite the metadata attached to file
contents so I could use tools that didn't understand what the original
metadata specified.

> [...] ... but then I don't use X11 (too many years of using the much
> more thoroughly integrated MacOS environment has spoiled me - every
> time I try to use X, I have to work hard to suppress the strong
> desire to do violence to the people responsible for it).

How odd.  Every time I have occasion to subject myself to a Mac UI, I
find myself with related feelings.  I don't know whether it's just a
question of what we're used to or whether there's something different
between us that makes us better matches to different UI styles.

Also, you may be confusing X with some common window system built on X.
It would be entirely possible to build a UI as thoroughly integrated as
the Mac one is atop X.  (I don't know why it hasn't been done, or why
it hasn't gained wide popularity if it has.)  X is not a window system,
despite being named as one; it's really a framework for building window
systems.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Follow-Ups:
- Re: OS/Environment integration [was: wide characters and i18n]
  - From: Matthew Mondor

References:
- wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: der Mouse
- Re: wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: der Mouse
- Re: wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: Erik Fair

Prev by Date: Re: Reorganizing src/tests
Next by Date: Re: wide characters and i18n
Previous by Thread: Re: wide characters and i18n
Next by Thread: Re: OS/Environment integration [was: wide characters and i18n]
Indexes:

Home | Main Index | Thread Index | Old Index