tech-misc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

RE: wchar_t encoding?



> >> Or do they actually assume it's gonna be utf32?
> >
> > No, that's exactly the issue.
> >
> > The C99 rule you quoted says (or at least implies) that the encoding
> of
> > wchar_t is locale dependent.  So the question is: how does a program
> > find out WHAT encoding wchar_t uses right now?  I don't see any API
> for
> > obtaining that information.  Clearly this is necessary -- how else
> can a
> > program construct properly encoded wide char data if it needs to do
> so
> > (as GDB does)?
> 
> There's api to convert between plain chars/strings and wide
> chars/strings, there is stdio api for wide chars/strings.
> 
> Why is that necessary to know the wide char bit patterns?

Maybe it isn't.  I'm trying to solve the problem GDB needs to solve with
minimal changes to GDB.

What it needs to do: it's given a string (narrow or wide) on a target
system.  It's told (by the user, defaulted in some suitable way) what
encoding that string has.  GDB reads that string from memory.  It then
wants to do something with it, for example print it.  It also needs to
do some basic processing, for example test for non-printable characters.

The current scheme is, in outline:

1. Read the string into a buffer (call it "targetbuf")
2. iconv_open ("wchar_t", target_string_encoding_nam)
3. iconv (..., targetbuf, ...  wcharbuf)

and it then has the target string, in wide char format, in the encoding
used by the host (which may be different from that used by the target).

The nice thing about iconv is that it converts between any source and
destination encoding in one step.

I see functions like mbtowc, and it looks like those could be used.  (In
fact, that's how libiconv implements iconv().)  But it becomes a multi
step process: first convert the string read from the target to a
multibyte string, probably with iconv.  I can't find any documentation
that says what the encoding of a multibyte string is, though.  libiconv
clearly assumes that it's "Unicode" (meaning UTF-8???).  If it's utf-8
or some other well-defined encoding, then that works.  Then the second
step would be to feed that intermediate encoding to mbtowc, which is
defined to translate to wide chars according to the current locale.

I don't see a narrow string to wc conversion, though there is a narrow
char (single char) to wc conversion.  But that doesn't do the conversion
GDB needs because it's defined to operate entirely in the current
locale, while the conversion GDB does is from a user-specified target
system encoding on input, to the locale host system encoding on output.
Note that the target OS may not be NetBSD, it may be a different byte
order, etc...

The two step conversion, if it does the job, seems acceptable.  If it
gets a whole lot more complicated it becomes hard to swallow, and also
hard for me to justify spending the effort.  After all, another way out
is to say that GDB 7 on NetBSD requires libiconv -- which eliminates the
problem entirely at the cost of having two libraries that implement
nearly identical version of iconv -- libc and libiconv.

        paul


Home | Main Index | Thread Index | Old Index