tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode programming

> Let's assume I have an application that I want to add Unicode support
> to; let's also assume that in this application, I already know when I
> get a sequence of bytes I know what the encoding of these bytes are

> - Assuming the above is correct ... what do programmers do in terms
>   of parsing things like UTF-8 into Unicode codepoints, since you
>   don't necessarily know that mbrtowc() will give you a Unicode
>   codepoint on some (looks like many) systems.

I'd do it myself.  UTF-8 is not hard to parse, and doing it that way
insulates against "surprises" in mbrtowc() and friends.

In theory, it's better to use the existing well-tested wheel rather
than reinventing it.  But given the current state of such things, I'm
far from convinced that the existing wheels are well-tested, never mind
the situation on systems where they don't exist at all.

>   I'm also wondering what people do about things like finding out how
>   many columns a particular series of Unicode codepoints occupies;

"columns"?  In terms of pixels, or character cells, or what?  In any
case, the answer will depend on what's displaying them; since you said
you're pushing display off to some other program, you can't really tell
even in principle.  Personally, I'd probably do what I do when I want
to line things up now: assume a character-cell font, meaning that each
character occupies one character cell.  But I'm hardly an expert.

>   Iknow about things like wcswidth(), but again you're not guaranteed
>   that wide characters are Unicode codepoints.

wcwidth and wcswidth are actually unimplementable, because they depend
on information not available even in theory to the application (the
application responsible for displaying text may not have even been
chosen, much less started, at wcwidth() time, and may run on a
completely different machine).  For that matter, when text is displayed
in a variable-pitch font, "column positions" don't really even exist.

> - Internally to your programs, do you use UTF-8 as your
>   representation?  UTF-16?  UTF-32?  I know, this depends on what
>   you're doing; I'm just trying to get a sense of what is common.

I don't really consider myself competent to comment on what's common.
In your situation, I'd probably just stuff codepoints in (a typedef
for) unsigned short - you said you're willing to write off anything
outside the BMP.  What code I've written that doesn't just treat
strings as opaque octet blobs (which may or may not contain UTF-8
encodings) generally assumes a one-byte encoding such as one of the
8859 sets.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML      
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index