tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: UTF-8 capable fmt(1)



>> Define "breaks"? And what "wide character support" means? (Specifically,
>> if it tries to figure out the width of unicode glyphs or sequences
>> - e.g. "What's the column usage of 'KATAKANA LETTER HE' + 'COMBINING
>> DIAERESIS' + 'LATIN SMALL LETTER X'? And where, if anywhere, should it
>> insert a break in that sequence?")
>> 
>> FWIW, I'm fine with replacing fmt with a newer version, but I'd like to
>> have a better idea of what it fixes.
>
>For me, it randomly breaks non-ASCII characters. I haven't really
>understood what it does; I think it strips out parts of the code
>points if it doesn't understand them.

I ran into similiar issues with 'par'; when I dug into it, here's what I
found:

- par (and I suspect fmt) use things like isspace() to determine if a
  particular BYTE is a space or not.
- If you are using UTF-8 locale, you might end up with a byte sequence
  like C3 A0 (U+00E0).
- If you can isspace() on those bytes, it might say they are spaces.
  A0 is interpreted as "no-break space" (because it maps to U+00A0).
  That causes them to get chewed up as they either get split, or substituted
  with a real space.  Other bytes ... depends on the system.
- It's unclear what is supposed to happen if you can isspace() with values
  greater than 127; POSIX says isspace() takes "characters".  Are those
  Unicode codepoints?  Unspecified.

In my experience, you can just use the POSIX functions for MOST
applications.  What you normally care about are "how many bytes make
up a character", "How wide is this character", and "is this character
a space/number/letter/etc".  You get all of that.  I have looked at
ICU before, but it is pretty big and does a lot of things that aren't
strictly necessary for things like fmt.

--Ken


Home | Main Index | Thread Index | Old Index