tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/57544: sed(1) and regex(3) problem with encoding



On Wed, Aug 30, 2023 at 02:32:25PM -0000, Christos Zoulas wrote:
> In article <a2ba5261-bf4a-f3c7-c614-c54088391e0f%SDF.ORG@localhost>,
> RVP  <rvp%SDF.ORG@localhost> wrote:
> >On Wed, 26 Jul 2023, tlaronde%polynum.com@localhost wrote:
> >
> >> $ export LC_CTYPE=fr_FR.ISO8859-15
> >>
> >> and then:
> >>
> >> $ echo "??" | sed 's/??\&eacute;/g'
> >> sed: 1: "s/??\&eacute;/g": RE error: trailing backslash (\)
> >>
> >
> >Not running NetBSD right now, but, FreeBSD 13.2 has the same issue which
> >can be seen even with a plain grep(1)--as it relies on the libc regexp
> >engine.
> >
> >Can you try the patch below (it is for NetBSD):
> 
> Why don't we make next and end unsigned char so that all instances are fixed?

Because one needs to review all the macros and all the invocations of
the macros because there are comparison between next and other
characters, and comparing unsigned char on one side and signed char on
the other is sure to introduce another can of worms.

I think RVP and I are in agreement about this: the whole lib should be
carefully reviewed. The patch proposed by RVP (the two casts, last patch
attached to the PR) is safe, correcting a fault and not modifying
something else; perhaps---and even probably--- not correcting all
the faults but at least, immediately, not introducing new ones.

I would have preferred that the library be "eight bits" clean
, i.e.  handling correctly the C language---ASCII---
and treating the extra range as is, with higher level libraries, if user
wants them, dealing with extended character sets and regex in order to
"compile" them to basic ones running on the core library, the way
microcode is converting CISC into RISC, with a core more simple (no
extended chars), sticking to C, and so more easy to make or prove
correct (the higher library explaining character classes and so on
according to the lang and the encoding etc.).

This whole "i18n" and "l10n" is a nightmare---and this is a not english
native speaker who writes it...
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                     http://www.kergis.com/
                    http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Home | Main Index | Thread Index | Old Index