pkgsrc-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: pkg_alternatives, Darwin and locale

Louis Guillaume wrote:
> Christian Biere wrote:
>> What is Apple supposed to fix? History?
> Well, try as they may, I fear they cannot do that :)

I thought Mac OS X comes with a time machine these days.

> But maybe they can fix their system so that tr and sed work in en_US  
> locales....

Many UNIX text-processing are inherently broken for hysterical raisins.
Once upon a time, the world consisted only of 8 bits and then there was
the multi-byte.

> $ export LC_ALL=en_US

As you're no specifying any character encoding, you leave it up to the
system to decide this. It's reasonable to default to UTF-8 as this fits
every character set. This is even more reasonable on Mac OS X which
enforces UTF-8 for filenames.

For example, I'm explicitely using "en_US.UTF-8". On NetBSD the other
two options for "en_US" are "en_US.ISO8859-1" and the nEUrotic

> $ printf "\254\n" | tr -d '\254'
> tr: Illegal byte sequence

Most-likely the locale's character encoding is UTF-8 because \254 isn't
a valid UTF-8 byte sequence.

> $ printf "\254\n" | sed 's/.//g'
> ¬

Note that sed, tr and many other UNIX utilities operate on 'text' not
'bytes' but only few of them can really handle multi-byte characters.

> By the way, this is only a problem in Leopard, from what I can tell. I  
> tried on Tiger and everything works "normally".

I think it's actually better if something simply doesn't work by
designed rather than giving the impression there's no problem and then
silently fail. If these implementations on other systems handle
multi-byte locale encodings perfectly fine, that's great. If it 'just
works' in some sub-cases but fails horribly in others, that's just bad.

Also note that the problem starts even before of these tools come into
play, namely when the shell parses the script. If the shell only accepts
text that is encoded in accordance to the current locale's encoding,
it may abort before invoking the tools with raw binary arguments. The
semantic of these characters may significantly differ depending on
the character encoding.

Last but not least, it's not just the oh-so-evil-who-needs-this-anyway
UTF-8 which is subject to multi-byte issues. It also applies to several
Asian character encoding and some are especially evil as they are not
straight supersets of ASCII with respect to the first 128 characters.


Home | Main Index | Thread Index | Old Index