tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: UTF-8 capable fmt(1)



Hi all,

I investigated the fmt code to find out where it breaks for me.
I found this thread when I checked whether my problem was already
discussed.

On Sat, Jan 09, 2016 at 02:44:09AM +0100, Thomas Klausner wrote:
> On Fri, Jan 08, 2016 at 04:21:13PM -0800, Tom Spindler (moof) wrote:
[...]
> > FWIW, I'm fine with replacing fmt with a newer version, but I'd like to
> > have a better idea of what it fixes.
> 
> For me, it randomly breaks non-ASCII characters. I haven't really
> understood what it does; I think it strips out parts of the code
> points if it doesn't understand them.

The point fmt breaks for me is: It tries to skip over non-printable
characters using this sequence

                        if(!(isprint(c) || c == '\t' || c >= 160)) {
                                c = getc(fi);
                                continue;
                        }       


Now, ß and ÄÖÜ and some Greek letters - let me randomly insert
ασδφ here - are represented in UTF-8 by hex 0xCY 0xZZ for 0x80 <= 0xZZ <
0xa0, so they're skipped over and lost; the CY combines then with
some innocent follow-up to produce something unspeakable.

Most of my needs are solved by a version with c >= 128 in the
above, maybe depending on strcmp(getenv("LC_CTYPE") ,"utf-8")).
This is a horrible hack and overestimates the screen space needed,
but that's good enough for me now.

Regards,
	-is


Home | Main Index | Thread Index | Old Index