[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Unicode programming
>> I'm also wondering what people do about things
>> like finding out how many columns a particular series of Unicode codepoints
>This is very much nontrivial. There are a certain number of codepoints
>which have an ambiguous number of columns. You might also run into
>situations where the renderer might not be able to display combining
>diacritics in the expected way.
Is this true for stuff inside of the BMP? I know, combining
characters are easily messed up ... but remember that's I'm a
command-line program, the rendering is going to be done by someone
else and I have no idea who that would be. If I get the number of
columns wrong, that's okay in this application (mildly annoying,
but I get the sense that getting it "right" is nearly impossible).
>That depends a lot on what kind of data you're expecting. If it's going
>to "mostly" be Western text, UTF8 is going to be the most space-efficient,
>and won't lull you into any false security about codepoints always
>being able to fit into n bytes, etc. UTF16 is usually around 50% more
>efficient for non-Roman glyphs (and thus if you do lots of e.g. East Asian
>processing, might be worthwhile). You still run into combining characters
>and surrogates and whatnot, though.
As I understand it, surrogates are purely a UTF-16 thing, right? I know
that means that I can't assume 2 bytes == 1 codepoint, but I think it
would be easy to special-case that. But if I have a special case then
maybe sticking with UTF-8 would be easier ... decisions, decisions ...
Thanks for your input!
Main Index |
Thread Index |