RE: wchar_t encoding?

To: "Martin Husemann" <martin%duskware.de@localhost>
Subject: RE: wchar_t encoding?
From: "Paul Koning" <Paul_Koning%Dell.com@localhost>
Date: Mon, 24 May 2010 10:01:53 -0400

> I'm sorry, I must be missing something, but it still is not clear to
me
> why you can't use the standard conversion on the host (or, why you can
> assume that host and target have the same wchar_t representation).
> 
> I would expect the side that runs the UI to tell the debugger about
> the character encoding, all relevant parts of the communication to
> happen in a locale dependend encoding according to that setting,
> and the debugger to just use standard calls to parse that strings.
> 
> If there is no common locale setting for both parties, how can you
> assume
> to be able to communicate in wchar_t streams? Passing wchar_t streams
> between machines doesn't seem like a good idea, but I guess that is
> where I'm missing something.

As you said, you definitely do NOT want to assume that host and target
have the same representation, or for that matter the same locale.  They
may well be different operating systems, or have different byte order,
and so on.

For that reason, gdb does a conversion when it reads string data from
target memory.  It comes from target memory as a byte string, and it
needs to convert that into something the host can use.

Standard calls like mbtowc don't work for this because they are meant to
convert from one host encoding to another, from and to the same locale.
On the other hand, iconv is specifically designed to convert between
encodings chosen explicitly -- not implicitly by the host type and
locale.

So gdb lets the user specify what string encodings are used on the
target (separately for "char" and "wchar_t" target types).  Gdb then
uses iconv to translate from that encoding to the one used internally on
the host.  The internal encoding is not explicitly chosen by the user;
instead gdb supplies "wchar_t" which in libiconv is the name of "the
encoding of the wchar_t type on this host".  For example, I might be
debugging an ACMEos target that uses KOI-8 for char and UCS-2 Big Endian
for its wchar_t; I would then set those two encodings as the target side
encodings for those two string types, and gdb would do the right thing
(display strings correctly).  

That codeset name "wchar_t" may be a libiconv extension; in any case,
the Citrus iconv on NetBSD does not support that encoding name.  The
result is that gdb 7 on NetBSD will not display string variables at all;
instead you get an error message.  So I'm looking for the name to use
instead of "wchar_t".  From what I've learned, "ucs-4" (more precisely,
"ucs-4be" or "ucs-4le" depending on the host byte order) is the right
answer most of the time but apparently not all the time.  So I'm
inclined to use that on the grounds that it makes the situation much
better for NetBSD, and if someone else can dig up the rest of the answer
that can still be done as a later improvement.

        paul

Follow-Ups:
- Re: wchar_t encoding?
  - From: Martin Husemann

References:
- wchar_t encoding?
  - From: Paul Koning
- Re: wchar_t encoding?
  - From: Neil Booth
- Re: wchar_t encoding?
  - From: Neil Booth
- Re: wchar_t encoding?
  - From: Martin Husemann

Prev by Date: Re: wchar_t encoding?
Next by Date: Re: wchar_t encoding?
Previous by Thread: Re: wchar_t encoding?
Next by Thread: Re: wchar_t encoding?
Indexes:

Home | Main Index | Thread Index | Old Index