tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode programming



>>   I'm also wondering what people do about things
>>   like finding out how many columns a particular series of Unicode codepoints
>>   occupies
>
>This is very much nontrivial. There are a certain number of codepoints
>which have an ambiguous number of columns. You might also run into
>situations where the renderer might not be able to display combining
>diacritics in the expected way.

Is this true for stuff inside of the BMP?  I know, combining
characters are easily messed up ... but remember that's I'm a
command-line program, the rendering is going to be done by someone
else and I have no idea who that would be.  If I get the number of
columns wrong, that's okay in this application (mildly annoying,
but I get the sense that getting it "right" is nearly impossible).

>That depends a lot on what kind of data you're expecting. If it's going
>to "mostly" be Western text, UTF8 is going to be the most space-efficient,
>and won't lull you into any false security about codepoints always
>being able to fit into n bytes, etc. UTF16 is usually around 50% more
>efficient for non-Roman glyphs (and thus if you do lots of e.g. East Asian
>processing, might be worthwhile). You still run into combining characters
>and surrogates and whatnot, though.

As I understand it, surrogates are purely a UTF-16 thing, right?  I know
that means that I can't assume 2 bytes == 1 codepoint, but I think it
would be easy to special-case that.  But if I have a special case then
maybe sticking with UTF-8 would be easier ... decisions, decisions ...

Thanks for your input!

--Ken


Home | Main Index | Thread Index | Old Index