tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



On Wed, 14 Jul 2010 19:38:42 -0700
Erik Fair <fair%netbsd.org@localhost> wrote:

> A whole lot of software has already been written to deal with this
> problem (but not necessarily completely or well), and you would do
> well to research what's available before attempting to reinvent your
> own rounder wheel - someone might have already solved your particular
> problem ... just not in the base NetBSD distribution.
> 
>       Erik <fair%netbsd.org@localhost>

Well yes, I did some research and in regard to C language and
internationalization, it's quite difficult to find documentation that
provides developers with sound advice and information about how to
handle different character encodings.

For example the code I'm writing at the moment parses program
configuration files, which are simple text files. However to assume
that text file == ASCII file is a bit restrictive. For example:

        log: "/path/to/file_name";

The path and the filename strings can be encoded in many different
ways - utf-8, utf-16, utf-32, jis, koi-8 and so on. I don't think
Unix filesystems care what encoding it is, they simply treat it as a
sequence of octets, as long as they don't contain NULL and / characters.

As a software developer you need to figure out the following two things:

1. What different encodings can your program accept and how to
determine/auto-detect them?

2. How do you represent this data internally in your program?

I think the answer to question 1 depends on the context. If it's your
local data, e.g. system configuration files, filename encodings, etc.
then the Unix locale is the most reliable way to tell the encoding.

If it's the data you get over the network, e.g. email, web pages, etc.
then the encoding is explicitly specified either in protocol headers,
or in the file.

The answer to question 2 is a bit more complex. Some environments use
utf-8 or utf-16, but since these are variable length encodings, you
can't have simple pointers to strings and you can't increment/decrement
pointers by N characters forward/backward.

I settled down on wchar_t and C library wide character functions. I
think I can use it with minimal fuss, and there is always iconv() or
similar if I need to convert from/to some weird encoding.


Home | Main Index | Thread Index | Old Index