Subject: lib/36938: mbtowc misbehaving after invalid char sequence
To: None <,,>
From: None <>
List: netbsd-bugs
Date: 09/06/2007 13:15:00
>Number:         36938
>Category:       lib
>Synopsis:       mbtowc fails converting valid sequences after invalid one
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Sep 06 13:15:00 +0000 2007
>Release:        NetBSD 4.99.23
System: NetBSD 4.99.23 NetBSD 4.99.23 (GENERIC) #0: Sun Jul 15 10:39:38 JST 2007 i386
Architecture: i386
Machine: i386

See commented example below.  After the invalid sequence, it fails
to convert a valid sequence.  This is not limited to UFT-8; it also
happens for other encodings so I believe the problem is generic,
if indeed it is a bug.  If it's not a bug, mbtowc would
seem to be useless in practice.  Code below succeeds on Linux.

#include <assert.h>
#include <locale.h>
#include <stdlib.h>

/* Valid 2-byte shift-JIS character, not valid UTF-8 sequence.  */
const char sjis[] = "\x95\x5c";   
/* Valid UTF-8, of course.  */
const char space[] = " ";

int main (void)
  wchar_t wc;

  setlocale (LC_CTYPE, "ja_JP.UTF-8");

  /* Assert it is not state-dependent.  */
  assert (mbtowc (&wc, 0, 1) == 0);

  /* Assert my charset beliefs.  */
  assert (mbtowc (&wc, space, sizeof space) == 1);
  assert (mbtowc (&wc, sjis, sizeof sjis) == -1);

  /* Unnecessary assertion that we're not state-dependent, but
     just in case some state needs resetting.  */
  assert (mbtowc (&wc, 0, 1) == 0);

  /* This assertion fails - I believe incorrectly.  */
  assert (mbtowc (&wc, space, sizeof space) == 1);

  return 0;
	Compile and run above.

 	Around Jul15 2007