Subject: Improved character handling (was: RE: wc: filename: invalid byte
To: Thomas Klausner <>
From: De Zeurkous <>
List: tech-userlevel
Date: 08/26/2007 11:23:36

On Sun, August 26, 2007 09:39, Thomas Klausner wrote:
> On Sun, Aug 26, 2007 at 09:36:13AM +0100, Iain Hibbert wrote:
>> I see that "The behaviour of mbrtowc() is affected by the LC_CTYPE
>> category of the current locale" .. do you have any locale settings?
> Actually, yes:

Arg. Inherent parsing difference between bytes and characters. This
distinction breaks pipelining completely. In fact, this kind of el weirdo
stuff is exactly why I oppose the half-ass backward combatability of
UTF-8. Recommend switching to 16-bit bytes throughout the system; this
allows us 8 bits for flags, of which I can see immediate use for seven:

10000000 -> Interpret as raw bytes
01000000 -> Interpret as ASCII
00100000 -> Interpret as UTF-8 (heck, a lot of software seems to have
                                adopted it...)
00010000 -> Interpret as UTF-32
00000100 -> Use extended format (for future expansion; software should bomb
                                 out on it for now)
00000010 -> Invalid (/very/ primitive form of error checking, but not
00000001 -> Ignore (more a matter of structural hygiene than anything else)

I've put the 'Use extended format' bit (and it's recommended handling) in
there to prevent us from repeating past mistakes. If anyone can find a
reasonable use for the remaining bit (00001000) apart from 'Reserved',
they can have it :^)

Recommend hacking an extra file type into the file sytem to make up for
the difference for the forseeable future. Since we already have file types
(as in regular files, directories, etc.) in the file system this is a much
less ugly hack than UTF-8 and ensures correct handling of legacy (i.e.,
ASCII, UTF-8) data.


De Zeurkous

Friggin' Machines!