tech-userlevel: Re: iconv and conversion from/to local charset and wchar

Subject: Re: iconv and conversion from/to local charset and wchar_t
To: Hendrik Sattler <ubq7@stud.uni-karlsruhe.de>
From: Noriyuki Soda <soda@sra.co.jp>
List: tech-userlevel
Date: 01/31/2004 00:04:35

Hi,

>>>>> On Thu, 29 Jan 2004 18:53:03 +0100,
	Hendrik Sattler <ubq7@stud.uni-karlsruhe.de> said:

> I am currently programming with GNU libiconv on Debian GNU/Linux. I was told 
> that the NetBSD iconv implementation does some things a bit different.

> Mainly, I am interested in some details about easy converting
> characters from  local input to an UCS-4 encoded string.
> To do this, GNU libiconv suggests (for iconv_open()) using "" (empty string) 
> or "char" for conversion from/to charset as defined by the current locale 
> (char*), and "wchar_t" for conversion from/to wchar_t*.
> Another method for the encoding of the string inside a char* that was read 
> from the system may be to use nl_langinfo(CODESET).

> What is the suggested method for NetBSD's iconv implementation? 

> Currently, I do a runtime check like
> iconv_open("","");
> iconv_open("char","char")
> icovn_open(nl_langinfo(CODESET),nl_langinfo(CODESET));
> and look at the return value _not_ being (iconv_t)-1.
> Will this work with NetBSD's iconv implementation?

I don't think you need the runtime check, because nl_langinfo(CODESET)
should work on all systems, including Linux, NetBSD and other
commercial UNIX variants.
The "" and "char" are GNU specific extension to iconv_open(3),
and this extension isn't really essential, because nl_langinfo(CODESET)
does same thing with portable way.

So, iconv_open(nl_langinfo(CODESET), "UCS-4") should work on NetBSD.

BTW, Solaris 8 and Solaris 9 only support UTF-7/8/16 for the direct
conversion from/to UCS-4.
So, you have to use the following way to make your program work
on Solaris 8 and Solaris 9:
	1. use iconv(nl_langinfo(CODESET), "UTF-8") to convert
	  locale dependent string to UTF-8,
	  Of course, this can be omitted, if nl_langinfo(CODESET)
	  returns "UTF-8".
	then
	2. use iconv("UTF-8", "UCS-4") to convert UTF-8 to
	  UCS-4 with machine depdenent endianness.
	(1. can be omitted, 

> Note that I cannot really use wchar_t as I need to do assumptions
> about the encoding and the numeric values of specific characters.

Yeah, the use of wchar_t in scmxx is a problematic point, it's better
to use something like "typedef int unichar_t" to hold unicode
character, instead of wchar_t.

Thanks for your interest to NetBSD.
--
soda