Unicode programming

To: tech-userlevel%netbsd.org@localhost
Subject: Unicode programming
From: Ken Hornstein <kenh%pobox.com@localhost>
Date: Wed, 05 Oct 2011 15:51:52 -0400

Greetings all,

I want to preface this right up front by saying that these questions aren't
technically NetBSD-related (although it involves software that is portable
to NetBSD), but I know there are bunch of people here who are much smarter
than I who understand all of this stuff and since we've talked about this
stuff before here, I hope this won't be out of line.

I've read up on Unicode, and of course I've read the stuff that was on this
mailing list last year (great help!).  I have some additional questions that
I was hoping someone would be willing to answer.

Let's assume I have an application that I want to add Unicode support
to; let's also assume that in this application, I already know when I
get a sequence of bytes I know what the encoding of these bytes are
(they won't necessarily be UTF-8).  This is a command-line application,
so I'm going to punt the heavy lifting in terms of displaying Unicode
glyphs to something else like xterm.  I'm at the Plan 9 level of
Unicode support; by that I mean I mostly only care about stuff in the
Basic Multilingual Plane, and I'm not worried about text with different
orientations.

- I'm aware of the multibyte functions like mbrtowc(), and I know that the
  these functions depends on the encoding set in your environment as to
  how they interpret their input.  But what I don't quite see is what these
  functions are supposed to output in terms of "wide characters"; it seems
  like this is unspecified.  I gather that if the C language implementation
  defines the macro __STDC_ISO_10646__ then you know that "wide" characters
  are Unicode codepoints.  If that macro isn't defined ... then I guess
  what wide characters are is undefined?  Is that correct?

- Assuming the above is correct ... what do programmers do in terms of
  parsing things like UTF-8 into Unicode codepoints, since you don't
  necessarily know that mbrtowc() will give you a Unicode codepoint on
  some (looks like many) systems.  I guess iconv() looks like something
  that handles a lot of encodings, and it seems to be lots of places;
  I'm also aware of icu.  I'm also wondering what people do about things
  like finding out how many columns a particular series of Unicode codepoints
  occupies; I know about things like wcswidth(), but again you're not
  guaranteed that wide characters are Unicode codepoints.

- Internally to your programs, do you use UTF-8 as your representation?
  UTF-16?  UTF-32?  I know, this depends on what you're doing; I'm just
  trying to get a sense of what is common.

Thanks for any advice you can give me,

--Ken

Follow-Ups:
- Re: Unicode programming
  - From: Martin Husemann
- Re: Unicode programming
  - From: Matthew Mondor
- Re: Unicode programming
  - From: Matthew Mondor
- Re: Unicode programming
  - From: Tom Spindler
- Re: Unicode programming
  - From: Mouse

Prev by Date: Re: A spell corrector for apropos
Next by Date: Re: Unicode programming
Previous by Thread: A spell corrector for apropos
Next by Thread: Re: Unicode programming
Indexes:

Home | Main Index | Thread Index | Old Index