[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: [PATCH] replace 0xA0 to whitespace in plain text files (part 2)
>>>> But most of new versions of the famous tools are going to be UTF-8
>>>> (wide char internally) compatible. Thus, less, wc, e.t.c. are
>>>> complaining on that kind of symbols which are looked like Unicode
>>>> sequence starters.
I trust there will be a way to shut this off? I ran into a real
headache on a Linux system which I eventually tracked down to its wc
being UTF-8 by default and exiting (silently!) as soon as it ran into
an invalid UTF-8 sequence. This broke rather severely when I, coming
from a traditional Unix background, used those tools to manipulate
bytes rather than characters.
If they complain, that at least will alert people to the problem. But
if they don't have any easy way to go back to the traditional
behaviour, I'll have to replace them - or, more likely, just not
"upgrade". I do not want UTF-8; if I want to use Unicode, it seems
much saner to me to use streams of hexdecets rather than encoding
hexdecets into octet streams with a funky variable-length encoding.
>>> I think you should only complain about files that are not valid
>> Not that I care so much, but are NetBSD supposed to have its files
>> in Latin1? Is that supposed to be the source character set, or
> I think that simply is the practical reality.
I think the default should be Latin-1, except that I also think tools
such as wc should, by default, not complain about invalid Latin-1,
instead sticking with the traditional behaviour of operating on bytes
rather than characters.
This is not to say that it should be impossible - or even difficult -
to make them use UTF-8 (or Latin-1 with errors for invalid octets).
Just that it shouldn't be the default.
/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Main Index |
Thread Index |