Subject: Re: behaviour of iconv in NetBSD and pkgsrc libiconv
To: Bruno Haible <bruno@clisp.org>
From: None <joerg@britannica.bec.de>
List: tech-userlevel
Date: 04/04/2006 15:28:33
On Mon, Apr 03, 2006 at 03:26:45PM +0200, Bruno Haible wrote:
> If you want to distinguish between invalid input and valid but unconvertible
> input, perform a conversion to "UTF-8".
> 
> > The correct behaviour to find the first unconvertible character, as sad
> > as it might seem, is to perform a binary search.
> 
> A binary search will need to work from the beginning of the text over and
> over again. (Because the iconv_t contains state information (e.g. when you
> convert from ISO-2022-JP-2), and since you have no way to clone a conversion
> descriptor or to go "backwards".)

So this makes finding the breaking character somewhere around O(n * lg n)
for the *error* case. Quite frankly, I don't care about that overhead
since it is only needed for diagnostics anyway.

If this should be optimised, the wchar interface is a better option,
e.g. wcsrtomb.

The problem I have with libiconv's behaviour is that it suggests that
incrementing the input buffer and restarting *is* allowed. In fact, that
seems to be exactly what libxml is doing.

> With the GNU implementations of iconv(), you have the choice between two
> variants: by default you get the useful behaviour. If you enable
> transliteration (by appending a "//TRANSLIT" to the to_code passed to
> iconv_open), less characters are treated as unconvertible.

Transliteration is something different. It is the approximiate
representation of characters in the context of a specific language or
character set environment. Typical examples are ligatures and currency
symbols. You still loose information, both physically and sementically,
but at least the former is approximated.

> But the
> GNU implementations don't just blindly replace unconvertible characters
> with question marks, because
>    1. this is not what programs need,

Please be careful when saying "this is not what programs need". In many
situations, it is exactly what is wanted. iconv tries an best effort
representation and most programs don't care about the error handling at
all. Compare running libxml by hand (feed back possible) with automatic
using it to filter something (feed back not possible and both errors and
diagnostics are often ignored).

>    2. when, say, converting Chinese to ASCII, an output like
>       "???????? ???? ?????" doesn't help.

Well, what else should it generate? It is not an error to convert valid
Chinese to ASCII and without transliteration, you can't do anything
else.

Joerg