tech-misc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wchar_t encoding?



Paul Koning wrote:-

> Gents,
> 
> I'm working on a patch to gdb 7.1 to make it work on NetBSD.  The issue
> is that GDB 7 uses iconv to handle character strings, and uses wide
> chars internally so it can handle various non-ASCII scripts.
> 
> The trouble for NetBSD is that it asks iconv to translate to a character
> set named "wchar_t".  That means "whatever the encoding is for the
> wchar_t data type".  GNU libiconv supports that, so on platforms that
> use that library things are fine.
> 
> NetBSD supports iconv, but it doesn't know the "wchar_t" encoding name.
> So I proposed a patch that substitutes what appears to be used instead,
> namely UCS-4 in platform native byte order (so "ucs-4le" on x86, for
> example).  This seems to work.
> 
> The trouble is that I'm getting pushback on the patch, because of
> concerns that the encoding used for wchar_t is not actually UCS-4.  In
> particular, there is this article:
> http://www.gnu.org/software/libunistring/manual/libunistring.html#The-wc
> har_005ft-mess which says that on Solaris and FreeBSD the encoding of
> wchar_t is "undocumented and locale dependent".  (Ye gods!)
> 
> Now, NetBSD is not FreeBSD... so... what is the answer for NetBSD?  Is
> it like FreeBSD?  (If so, it would be good to fix that.)  Or is it a
> fixed encoding, and if so, is it indeed ucs-4?

NetBSD uses citrus.  From what I've figured out, there are 2 wchar_t
encodings: ucs-4 and one I'll call "kuten".  The latter is a natural
wide character encoding for some of the narrow character encodings
of far eastern character sets.  The following page touches on kuten
a bit:

  http://en.wikipedia.org/wiki/JIS_X_0208

This decision to not use UCS-4 univerally for wchar_t is one that
raises much heat, and unfortunately leaves those of us requiring
a single wchar_t enoding somewhat stuck.

Because of this, if you want to convert to ucs-4, you need an extra
kuten<->ucs4 converter and step.  I don't believe the ability to
do this is given by C, or POSIX, or even Citrus.  It's a sad
situation -- there are legitimate reasons to need to be able to do
this; it is untrue that "you should not care".  Consider the case
I had: a compiler front end needing to handle extended identifier
characters, and characters with UCNs in them, and wanting to ensure
that the same identifier written both ways was treated identically.
You need to be able to convert your identifiers to UCS-4, and there
is no portable way to do so.

I do believe that these 2 wchar_t are the only ones you'll meet.
But you'll need to have a kuten<->UCS4 map somewhere.

Neil.


Home | Main Index | Thread Index | Old Index