tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



On Sat, 10 Jul 2010 10:32:39 +0100
Sad Clouds <cryintothebluesky%googlemail.com@localhost> wrote:

> Hi, I'm trying to understand how to write portable C code that supports
> international character sets. As I understand so far, it has a lot to
> do with C library and current locale setting.
> 
> 1. What is the recommended way for user level applications to deal with
> different character encodings? Treat all external character data as
> multi-byte and use C library wchar_t and wide stdio functions to
> convert multi-byte to wchart_t for internal character processing?

Others might also have good suggestions, I only have some experience
with UTF-8 and UTF-32/UCS-4 here, and they were using custom code
rather than the wchar C99 related functions.  I can however share some
of the "issues" I encountered.

When using them in the way above, basically the input should be
considered UTF-8 and decoded to an internal 32-bit (host-endian)
representation.  Here there can be problems if the input isn't valid
UTF-8, in which case various implementations (and applications)
differ.  Some will treat invalid sequences as ISO-8859-15 and convert
them to their equivalent UTF-32.  Others will simply refuse to parse
the string (if I understand, the C99 functions will stop at invalid
sequences with a restartable error, allowing the application to decide
what to do).

When dealing with invalid UTF-8 or UTF-16 input sequences, a possible
solution is to store those invalid sequence bytes or words as-is by
using a range of special/invalid unicode characters to map them.  This
permits to non-destructively preserve the original input's integrity.
Some software will also simply replace any invalid sequence by the
special unicode "invalid character", but it's generally considered
bad practice, being destructive.

For output, the 32-bit representation is then encoded back to UTF-8 (or
the external encoding of course).  If a special range of characters was
used to preserve invalid sequences, those bytes/words are restored as
they were.

I didn't look much at the wchar_t support implementation, but it seems
that wchar_t maps to a 32-bit int, so it shouldn't be much different
to what I described.

> 2. How extensive is NetBSD's support for i18n and wide characters? Are
> there any missing bits or things I need to look out for?

I'm not sure how well normalization is done, but obviously in UCS-4 the
rules are different to convert between lowercase and uppercase, to
compare for sorting and to convert between accentuated character to
unaccentuated character (useful for matching strings against
user-supplied keywords for instance).  So it's important to use the
proper functions to perform those operations.  An example is using
wcsncmp(3) over strncmp(3), towlower(3) over tolower(3) etc.

> 3. Any functions that are not thread-safe that I need to look out for?

The strerror_r(3) function should be used instead of strerror(3) in
threads because with the advent of NLS and locales strerror(3) can no
longer simply return a pointer to static const string from a const
array.

As for the input/output wchar_t related functions I'm unsure if their
state is thread-safe or requires explicit locking for concurrency.

Handling locales correctly is more complex too, as a locale might use a
different decimal formatting and date format, and its typographical
conventions might favour a particular quoting format.  Some functions
support local-specific output options, such as strftime(3).  I'm not
sure if printf(3) is supposed to support this automatically for decimal
or not.  nl_langinfo(3) can be used for libraries to conform to the
locale in use (NLS(7) has more information).  I personally have no
experience with it here.
-- 
Matt


Home | Main Index | Thread Index | Old Index