Re: bin/57544: sed(1) and regex(3) problem with encoding

To: tech-userlevel%netbsd.org@localhost
Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
From: christos%astron.com@localhost (Christos Zoulas)
Date: Wed, 30 Aug 2023 20:38:39 -0000 (UTC)

In article <ZO955+sp0H7eZlwp%polynum.com@localhost>,  <tlaronde%polynum.com@localhost> wrote:
>On Wed, Aug 30, 2023 at 02:32:25PM -0000, Christos Zoulas wrote:
>> In article <a2ba5261-bf4a-f3c7-c614-c54088391e0f%SDF.ORG@localhost>,
>> RVP  <rvp%SDF.ORG@localhost> wrote:
>> >On Wed, 26 Jul 2023, tlaronde%polynum.com@localhost wrote:
>> >
>> >> $ export LC_CTYPE=fr_FR.ISO8859-15
>> >>
>> >> and then:
>> >>
>> >> $ echo "??" | sed 's/??\&eacute;/g'
>> >> sed: 1: "s/??\&eacute;/g": RE error: trailing backslash (\)
>> >>
>> >
>> >Not running NetBSD right now, but, FreeBSD 13.2 has the same issue which
>> >can be seen even with a plain grep(1)--as it relies on the libc regexp
>> >engine.
>> >
>> >Can you try the patch below (it is for NetBSD):
>> 
>> Why don't we make next and end unsigned char so that all instances are fixed?
>
>Because one needs to review all the macros and all the invocations of
>the macros because there are comparison between next and other
>characters, and comparing unsigned char on one side and signed char on
>the other is sure to introduce another can of worms.
>
>I think RVP and I are in agreement about this: the whole lib should be
>carefully reviewed. The patch proposed by RVP (the two casts, last patch
>attached to the PR) is safe, correcting a fault and not modifying
>something else; perhaps---and even probably--- not correcting all
>the faults but at least, immediately, not introducing new ones.
>
>I would have preferred that the library be "eight bits" clean
>, i.e.  handling correctly the C language---ASCII---
>and treating the extra range as is, with higher level libraries, if user
>wants them, dealing with extended character sets and regex in order to
>"compile" them to basic ones running on the core library, the way
>microcode is converting CISC into RISC, with a core more simple (no
>extended chars), sticking to C, and so more easy to make or prove
>correct (the higher library explaining character classes and so on
>according to the lang and the encoding etc.).
>
>This whole "i18n" and "l10n" is a nightmare---and this is a not english
>native speaker who writes it...

It is not that much code to review; I reviewed it and committed the minimal
change. There were 3 places where GETNEXT was promoted and not assigned to
a char.

Best,

christos

References:
- Re: bin/57544: sed(1) and regex(3) problem with encoding
  - From: RVP
- Re: bin/57544: sed(1) and regex(3) problem with encoding
  - From: Christos Zoulas
- Re: bin/57544: sed(1) and regex(3) problem with encoding
  - From: tlaronde

Prev by Date: Re: bin/57544: sed(1) and regex(3) problem with encoding
Next by Date: Re: bin/57544: sed(1) and regex(3) problem with encoding
Previous by Thread: Re: bin/57544: sed(1) and regex(3) problem with encoding
Next by Thread: Re: bin/57544: sed(1) and regex(3) problem with encoding
Indexes:

Home | Main Index | Thread Index | Old Index