tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode programming

>   I'm also wondering what people do about things
>   like finding out how many columns a particular series of Unicode codepoints
>   occupies

This is very much nontrivial. There are a certain number of codepoints
which have an ambiguous number of columns. You might also run into
situations where the renderer might not be able to display combining
diacritics in the expected way.

> - Internally to your programs, do you use UTF-8 as your representation?
>   UTF-16?  UTF-32?  I know, this depends on what you're doing; I'm just
>   trying to get a sense of what is common.

That depends a lot on what kind of data you're expecting. If it's going
to "mostly" be Western text, UTF8 is going to be the most space-efficient,
and won't lull you into any false security about codepoints always
being able to fit into n bytes, etc. UTF16 is usually around 50% more
efficient for non-Roman glyphs (and thus if you do lots of e.g. East Asian
processing, might be worthwhile). You still run into combining characters
and surrogates and whatnot, though.

Home | Main Index | Thread Index | Old Index