NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/57544: sed(1) and regex(3) problem with encoding



The following reply was made to PR bin/57544; it has been noted by GNATS.

From: tlaronde%polynum.com@localhost
To: gnats-bugs%netbsd.org@localhost
Cc: RVP <rvp%SDF.ORG@localhost>, Martin Husemann <martin%duskware.de@localhost>,
        Taylor R Campbell <campbell+netbsd-tech-userlevel%mumble.net@localhost>
Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
Date: Mon, 31 Jul 2023 10:52:07 +0200

 RVP has indeed found the culprit so the above diff:
 
 Index: regcomp.c
 ===================================================================
 RCS file: /pub/NetBSD-CVS/src/lib/libc/regex/regcomp.c,v
 retrieving revision 1.46
 diff -u -r1.46 regcomp.c
 --- regcomp.c	11 Mar 2021 15:00:29 -0000	1.46
 +++ regcomp.c	31 Jul 2023 08:32:56 -0000
 @@ -900,10 +900,10 @@
  	handled = false;
  
  	assert(MORE());		/* caller should have ensured this */
 -	c = GETNEXT();
 +	c = (unsigned char)GETNEXT();
  	if (c == '\\') {
  		(void)REQUIRE(MORE(), REG_EESCAPE);
 -		cc = GETNEXT();
 +		cc = (unsigned char)GETNEXT();
  		c = BACKSL | cc;
  #ifdef REGEX_GNU_EXTENSIONS
  		if (p->gnuext) {
 
 solves the problem.
 
 Explanation: the regex(3) is decorating a char or a sequence treatment
 by using an int and, in p_simp_re() was setting in the int the bit
 immediately left to the bits needed for a char to 1:
 
 #       define  BACKSL  (1<<CHAR_BIT)
 
 when it was an escaped sequence before accessing the next char. And the
 treatment was after, testing for this flag.
 
 On a machine with signed chars and two-complement, where the sign bit
 is "extended", every negative char was then tested as been an escaped
 sequence.
 
 From a cursory look, the difference between setting LC_CTYPE=C (no
 problem) or LC_CTYPE=fr_FR.ISO8859-15 (just as an example) is perhaps
 that in the first case extended RE are assumed, while in the latter case
 legacy is used, hence not following the same path (legacy using
 p_simp_re() while ERE uses p_ere_exp()). 
 
 But the whole code should be reviewed by someone knowing the
 intrincasies between the locales and ctype, and the problem of
 signed/unsigned (and to add more, two-complement) needs also a more
 thorough review.
 
 Ironically, in WHATSNEW (dating BSD 4.4...) there is this:
 
 Most uses of "uchar" are gone; it's all chars now.  Char/uchar
 parameters are now written int/unsigned, to avoid possible portability
 problems with unpromoted parameters.  Some unsigned casts have been
 introduced to minimize portability problems with shifting into sign
 bits.
 
 So signed/unsigned and portability problems are not new...
 -- 
         Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
                     http://kertex.kergis.com/
 Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C
 


Home | Main Index | Thread Index | Old Index