tech-misc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

RE: wchar_t encoding?



> On Mon, May 24, 2010 at 10:01:53AM -0400, Paul Koning wrote:
> > For that reason, gdb does a conversion when it reads string data
from
> > target memory.  It comes from target memory as a byte string, and it
> > needs to convert that into something the host can use.
> 
> Ok, I nearly understand what you are trying to explain, but I
> don't understand why any conversion is needed at all. If you have
> a char * variable on the target and the host wants to display the
> content -
> it can only do that reasonably when knowing target's current LC_CTYPE
> and applying "a compatible LC_CTYPE" on the host. Why is a wchar_t
> string
> different?

I don't think it is.  I probably didn't explain it well.

Say I'm running gdb on a host that uses UTF-8 char strings, so the
locale on the host side is set accordingly.  But the target doesn't use
Unicode, it's using some national set, like KOI or Latin-6 or whatever.
So we need to translate the bytes read from the target in order to come
up with host side bits that look right when given to printf. 
 
> >  From what I've learned, "ucs-4" (more precisely,
> > "ucs-4be" or "ucs-4le" depending on the host byte order) is the
right
> > answer most of the time but apparently not all the time.
> 
> So you are saying that the target reads the wchar_t * from memory,
> converts
> it to ucs-4* and transfers the result to the host? Maybe for the
> purpose of
> debugging this is close enough to be an acceptable solution; it should
> even
> work (modulo some loss) when the targets internal wchar_t
> representation
> currently is jis/kuten.

The host to target data transfer is always in terms of byte strings.
The host interprets those, based on the data types.  For example, if you
ask for the value of an int, gdb will read however many bytes that is on
the target, and use its knowledge of how the target encodes ints in
order to display the value.  Similarly if you ask for the value of a
string (narrow or wide).

What makes it harder for strings is that their encoding depends on
locale, while the encoding of int does not.

As for jis/kuten, that's what Neil mentioned.  I know next to nothing
about this but from what I read on Wikipedia it appears that JIS-0208 is
a subset of Unicode.  So I'm puzzled why jis/kuten would be used as the
wchar_t encoding. 
 
> But why (besides gdb folks not having it designed this way) couldn't
> the
> target convert the string to soemthing the host and it agreed upon? Or
> maybe even apply no conversion at all and have the user manually set
> some
> compatible environment on the host?

The conversion is in fact in the host.  And wchar_t is used on the
theory that this is the "handles everything" string type.

As for a compatible environment, that assumes there is one; there might
not be.  Iconv is pretty general but a given OS might not have such a
wide range of locale values it knows.  For example, in NetBSD is there a
locale that says you're using ucs-2?  I don't think so.

        paul 


Home | Main Index | Thread Index | Old Index