tech-userlevel: Re: behaviour of iconv in NetBSD and pkgsrc libiconv

Subject: Re: behaviour of iconv in NetBSD and pkgsrc libiconv
To: James K. Lowden <jklowden@schemamania.org>
From: Bruno Haible <bruno@clisp.org>
List: tech-userlevel
Date: 04/03/2006 15:26:45

Hi,

Joerg wrote:

> > This is even mentioned in our man page iconv(3):
> >
> >   "If no conversion exists for a particular character, an
> >    implementation-defined conversion is performed on this character."

This is how POSIX:2001 specifies it.

> > NetBSD's iconv() completes the conversion of the whole buffer and maps
> > such characters to a question mark. The return value of iconv() shows
> > how many of those non-reversible conversions happened.

This too is POSIX compliant.

> > In contrast, converters/libiconv stops the conversion at this point,
> > returns an error and gives the application a chance to do something
> > about the unconvertible character [1].
>
> The GNU implementation is clearly broken.

The GNU implementations of iconv() - both the one in glibc and libiconv -
stop when an unconvertible character is encountered. This is not POSIX
compliant, but I would qualify it as "useful", not "broken".

> An implementation defined
> conversion is *not* an error. EILSEQ is an absolutely inappropiate error
> message, since it doesn't allow to distinguish between invalid input and
> valid but unconvertible input.

If you want to distinguish between invalid input and valid but unconvertible
input, perform a conversion to "UTF-8".

> The correct behaviour to find the first unconvertible character, as sad
> as it might seem, is to perform a binary search.

A binary search will need to work from the beginning of the text over and
over again. (Because the iconv_t contains state information (e.g. when you
convert from ISO-2022-JP-2), and since you have no way to clone a conversion
descriptor or to go "backwards".)

Alternatively you can feed bytes one by one into the conversion descriptor.
But this is slow as well. You see that the POSIX spec is inadequate.

James K. Lowden wrote:
> From the FreeTDS
> perspective, it's important to distinguish between bad input and a weak
> output encoding.

Then use iconv_open("UTF-8", from_code).

With the GNU implementations of iconv(), you have the choice between two
variants: by default you get the useful behaviour. If you enable
transliteration (by appending a "//TRANSLIT" to the to_code passed to
iconv_open), less characters are treated as unconvertible. But the
GNU implementations don't just blindly replace unconvertible characters
with question marks, because
   1. this is not what programs need,
   2. when, say, converting Chinese to ASCII, an output like
      "???????? ???? ?????" doesn't help.

Bruno