Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
To: None <lib-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Takehiko NOZAKI <th-nozaki@netwrk.co.jp>
List: netbsd-bugs
Date: 11/13/2007 17:15:05
The following reply was made to PR lib/36938; it has been noted by GNATS.

From: Takehiko NOZAKI <th-nozaki@netwrk.co.jp>
To: gnats-bugs@netbsd.org
Cc: neil@daikokuya.co.uk
Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
Date: Wed, 14 Nov 2007 00:19:13 +0900

 hi, Neil.
 
 >  tnozaki marked this bug closed, but it seems did not understand
 >  the report.
 > 
 
  current src/lib/libc/citrus/modules/citrus_utf8.c
 (and other multibye encoding modules) implementation:
 
   219	/* make sure we have the first byte in the buffer */
   220	if (psenc->chlen == 0) {
   221		if (n-- < 1)
   222			goto restart;
   223		psenc->ch[psenc->chlen++] = *s0++;
   224	}
   225
   226	c = _UTF8_count[psenc->ch[0] & 0xff];
   227	if (c < 1 || c < psenc->chlen)
   228		goto ilseq;
 
  - read first 1-byte into internal-state(line 223).
  - check it whether valid character or not(line 226-227).
 
 so that internal-state always become ``none-initial'' state.
 
  OTOH many mbtowc(3) implementations,
 (AFAIK glibc2, Solaris, FreeBSD, MSVC++6) seems that:
 
  - check first 1-byte is valid character or not(if invalid, return -1).
  - store it into internal-state for restart.
 
 so that internal-state remains ``initial'' state.
 
 but ``How to store internal-state with pieces of multibyte sequence''
 is implementation defined behavior, because SUSv3's documentation
 doesn't mention about it(correct me if i'm wrong).
 
 http://opengroup.org/onlinepubs/007908799/xsh/mbtowc.html
 
 #  in case of mbrtowc(3) and mbstate_t,
 # "the conversion state is undefined" when return value is (size_t)-1.
 # 
 # http://opengroup.org/onlinepubs/007908799/xsh/mbrtowc.html
 # http://opengroup.org/onlinepubs/007908799/xsh/wchar.h.html
 
  so that, whether current locale is stateless or stateful,
 you can not omit to re-initialize internal state of mbtowc(3) by #if 0'ed,
 i think.
 
 
 ...but we are minority, we might change behavior in the future.
 
 very truly yours.
 --
 Takehiko NOZAKI <tnozaki@NetBSD.org>