Re: Unicode programming

To: Ken Hornstein <kenh%pobox.com@localhost>
Subject: Re: Unicode programming
From: Tom Spindler <dogcow%babymeat.com@localhost>
Date: Wed, 5 Oct 2011 14:49:12 -0700

>   I'm also wondering what people do about things
>   like finding out how many columns a particular series of Unicode codepoints
>   occupies

This is very much nontrivial. There are a certain number of codepoints
which have an ambiguous number of columns. You might also run into
situations where the renderer might not be able to display combining
diacritics in the expected way.

> - Internally to your programs, do you use UTF-8 as your representation?
>   UTF-16?  UTF-32?  I know, this depends on what you're doing; I'm just
>   trying to get a sense of what is common.

That depends a lot on what kind of data you're expecting. If it's going
to "mostly" be Western text, UTF8 is going to be the most space-efficient,
and won't lull you into any false security about codepoints always
being able to fit into n bytes, etc. UTF16 is usually around 50% more
efficient for non-Roman glyphs (and thus if you do lots of e.g. East Asian
processing, might be worthwhile). You still run into combining characters
and surrogates and whatnot, though.

Follow-Ups:
- Re: Unicode programming
  - From: Ken Hornstein

References:
- Unicode programming
  - From: Ken Hornstein

Prev by Date: Re: web inherits UNIX console?
Next by Date: Re: A spell corrector for apropos
Previous by Thread: Re: Unicode programming
Next by Thread: Re: Unicode programming
Indexes:

Home | Main Index | Thread Index | Old Index