Re: pkg_alternatives, Darwin and locale

To: pkgsrc-users%netbsd.org@localhost
Subject: Re: pkg_alternatives, Darwin and locale
From: Christian Biere <christianbiere%gmx.de@localhost>
Date: Tue, 22 Jul 2008 01:29:26 +0200

Louis Guillaume wrote:
> Christian Biere wrote:
>> What is Apple supposed to fix? History?
> Well, try as they may, I fear they cannot do that :)

I thought Mac OS X comes with a time machine these days.

> But maybe they can fix their system so that tr and sed work in en_US  
> locales....

Many UNIX text-processing are inherently broken for hysterical raisins.
Once upon a time, the world consisted only of 8 bits and then there was
the multi-byte.

> $ export LC_ALL=en_US

As you're no specifying any character encoding, you leave it up to the
system to decide this. It's reasonable to default to UTF-8 as this fits
every character set. This is even more reasonable on Mac OS X which
enforces UTF-8 for filenames.

For example, I'm explicitely using "en_US.UTF-8". On NetBSD the other
two options for "en_US" are "en_US.ISO8859-1" and the nEUrotic
"en_US.ISO8859-15".

> $ printf "\254\n" | tr -d '\254'
> tr: Illegal byte sequence

Most-likely the locale's character encoding is UTF-8 because \254 isn't
a valid UTF-8 byte sequence.

> $ printf "\254\n" | sed 's/.//g'
> ¬

Note that sed, tr and many other UNIX utilities operate on 'text' not
'bytes' but only few of them can really handle multi-byte characters.

> By the way, this is only a problem in Leopard, from what I can tell. I  
> tried on Tiger and everything works "normally".

I think it's actually better if something simply doesn't work by
designed rather than giving the impression there's no problem and then
silently fail. If these implementations on other systems handle
multi-byte locale encodings perfectly fine, that's great. If it 'just
works' in some sub-cases but fails horribly in others, that's just bad.

Also note that the problem starts even before of these tools come into
play, namely when the shell parses the script. If the shell only accepts
text that is encoded in accordance to the current locale's encoding,
it may abort before invoking the tools with raw binary arguments. The
semantic of these characters may significantly differ depending on
the character encoding.

Last but not least, it's not just the oh-so-evil-who-needs-this-anyway
UTF-8 which is subject to multi-byte issues. It also applies to several
Asian character encoding and some are especially evil as they are not
straight supersets of ASCII with respect to the first 128 characters.

-- 
Christian

References:
- pkg_alternatives, Darwin and locale
  - From: Louis Guillaume
- Re: pkg_alternatives, Darwin and locale
  - From: Christian Biere
- Re: pkg_alternatives, Darwin and locale
  - From: Louis Guillaume

Prev by Date: Re: pkg_alternatives, Darwin and locale
Next by Date: Re: pkg_alternatives, Darwin and locale
Previous by Thread: Re: pkg_alternatives, Darwin and locale
Next by Thread: gimp vs. dbus uuid
Indexes:

Home | Main Index | Thread Index | Old Index