Re: Unicode programming

To: tech-userlevel%netbsd.org@localhost
Subject: Re: Unicode programming
From: Matthew Mondor <mm_lists%pulsar-zone.net@localhost>
Date: Wed, 5 Oct 2011 18:54:06 -0400

On Wed, 05 Oct 2011 15:51:52 -0400
Ken Hornstein <kenh%pobox.com@localhost> wrote:

> - Internally to your programs, do you use UTF-8 as your representation?
>   UTF-16?  UTF-32?  I know, this depends on what you're doing; I'm just
>   trying to get a sense of what is common.

So far I've used UTF-32/UBCS-4 for internal representation and UTF-8 as
external representation only in my own software.  A complication exists
if invalid UTF-8 input sequences are possible (an example is IRC or
random user-provided data), in which case there are various possible
solutions:

- Error and reject the input (strict, secure, but annoying)
- Treat the invalid input octets as ISO-8859-*, this could be
  problematic if you must output the original sequence as-is, and there
  is also no guarantee that the octets were intended as LATIN-1
  characters
- Import the invalid input octects as special codepoints within an
  UTF-16 surrogate range (i.e. the so-called UTF-8B encoding): Invalid
  octects get mapped to 0xDCxx on input/decoding, and converted back
  as-is as 8-bit octets on output/encoding.  Of course, the application
  must treat that character range appropriately too.

I don't have personal experience of other unicode encodings.  However,
I highly recommend explicitely marking objects with an encoding tag if
you intend to support multiple encodings, because guessing an encoding
is non-trivial, and conversions between them may be lossy.  I think
that the ideal is when a protocol explicitely specifies the expected
encoding, making things simpler (i.e. like in my case with strictly
UTF-8 external representations).

A problematic example are filenames in file systems which allow
arbitrary bytes (like FFS).  I tend to encounter both LATIN-1 and UTF-8
filenames in French, but filenames are not tagged with an encoding.
When you control the file creation and the remote protocol allows to
know the encoding, I guess that you could tag filenames either using a
MIME message header-like format (i.e. =?UTF-8?B?<...>=?=) or using an
extended attribute or custom metadata format, but there is no definite
standard to tag unicode filenames with their encoding.  Some file
systems expect valid UTF-16 or UTF-8 strings, though.

An area which I found slightly challenging was allowing the user to
search within unicode data.  Both the user input keywords and data
index had to be passed through a normalizing function such that
searching for say, Francais or Français could match both Francais and
Français.  This had to be extended to ligatures.  With postgresql, I
had to use a contrib named unaccent to be able to apply the same
normalization conversions to the data being matched as the ones
performed in the application.  And that only considers european
languages, it must be a greater challenge to support asian languages...
-- 
Matt

Follow-Ups:
- Re: Unicode programming
  - From: Ken Hornstein

References:
- Unicode programming
  - From: Ken Hornstein

Prev by Date: Re: A spell corrector for apropos
Next by Date: Re: Unicode programming
Previous by Thread: Re: Unicode programming
Next by Thread: Re: Unicode programming
Indexes:

Home | Main Index | Thread Index | Old Index