Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
To: None <lib-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Neil Booth <neil@daikokuya.co.uk>
List: netbsd-bugs
Date: 11/21/2007 13:55:02
The following reply was made to PR lib/36938; it has been noted by GNATS.

From: Neil Booth <neil@daikokuya.co.uk>
To: gnats-bugs@NetBSD.org
Cc: lib-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org
Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
Date: Wed, 21 Nov 2007 22:52:02 +0900

 Takehiko NOZAKI wrote:-
 
 >   current src/lib/libc/citrus/modules/citrus_utf8.c
 >  (and other multibye encoding modules) implementation:
 >  
 >    219	/* make sure we have the first byte in the buffer */
 >    220	if (psenc->chlen == 0) {
 >    221		if (n-- < 1)
 >    222			goto restart;
 >    223		psenc->ch[psenc->chlen++] = *s0++;
 >    224	}
 >    225
 >    226	c = _UTF8_count[psenc->ch[0] & 0xff];
 >    227	if (c < 1 || c < psenc->chlen)
 >    228		goto ilseq;
 >  
 >   - read first 1-byte into internal-state(line 223).
 >   - check it whether valid character or not(line 226-227).
 >  
 >  so that internal-state always become ``none-initial'' state.
 >  
 >   OTOH many mbtowc(3) implementations,
 >  (AFAIK glibc2, Solaris, FreeBSD, MSVC++6) seems that:
 >  
 >   - check first 1-byte is valid character or not(if invalid, return -1).
 >   - store it into internal-state for restart.
 >  
 >  so that internal-state remains ``initial'' state.
 >  
 >  but ``How to store internal-state with pieces of multibyte sequence''
 >  is implementation defined behavior, because SUSv3's documentation
 >  doesn't mention about it(correct me if i'm wrong).
 >  
 >  http://opengroup.org/onlinepubs/007908799/xsh/mbtowc.html
 >  
 >  #  in case of mbrtowc(3) and mbstate_t,
 >  # "the conversion state is undefined" when return value is (size_t)-1.
 >  # 
 >  # http://opengroup.org/onlinepubs/007908799/xsh/mbrtowc.html
 >  # http://opengroup.org/onlinepubs/007908799/xsh/wchar.h.html
 >  
 >   so that, whether current locale is stateless or stateful,
 >  you can not omit to re-initialize internal state of mbtowc(3) by #if 0'ed,
 >  i think.
 >  
 >  
 >  ...but we are minority, we might change behavior in the future.
 
 Nozaki-san, reading the C standard again I think NetBSD is not
 behaving properly here, in the case of a non-state-dependent
 encoding.  The standard says that calls to mbtowc alter the internal
 state "as necessary" (7.20.7).  However, one of the assertions of
 the code I posted is that UTF-8 is not state-dependent; hence the
 converter should always be in the initial shift state (from the
 language user's point of view; I understand this may not be the case in
 the implementation).  So I believe that, since UTF-8 is not a
 state-dependent encoding, we should be able to call mbtowc at any
 time and expect it to be in the initial shift state.
 
 I would agree that it is not 100% clear though.
 
 Neil.