Subject: Re: the state of regex(3)
To: Jason Thorpe <thorpej@shagadelic.org>
From: Alistair Crooks <agc@pkgsrc.org>
List: tech-userlevel
Date: 09/30/2004 21:37:58
On Wed, Sep 29, 2004 at 03:03:04PM -0700, Jason Thorpe wrote:
> 
> On Sep 28, 2004, at 2:17 PM, Alistair Crooks wrote:
> 
> >If Jason could help me out and tell me exactly what the sticking point
> >is, I'd be grateful.
> 
> The sticking point is -- If we replace our regex with PCRE, then we can 
> never pass a POSIX test suite if it happens to test the incompatible 
> feature (which any comprehensive one should).  I think that could be a 
> major issue for some users of the system.

You are assuming that NetBSD could pass a POSIX test suite now.

It can't.

There are certain restrictions on POSIX regular expressions; one is
that the *USER* should keep their expressions below 256 characters in
length, to keep them portable - see re_format(7).  Whilst NetBSD's
regex code can handle longer expressions, someone else's
POSIX-conformant code may not.  One drawback of standards going for
the lowest common denominator.  Are you going to add a warning message
to regcomp(3) for every regexp which is 256 chars or greater, just so
that POSIX-conformance is assured?

There is also POSIX_MISTAKE - take a look at src/lib/libc/regex/regcomp.c

#ifndef POSIX_MISTAKE
        case ')':               /* happens only if no current unmatched ( */
                /*
                 * You may ask, why the ifndef?  Because I didn't notice
                 * this until slightly too late for 1003.2, and none of the
                 * other 1003.2 regular-expression reviewers noticed it at
                 * all.  So an unmatched ) is legal POSIX, at least until
                 * we can get it fixed.
                 */
                SETERROR(REG_EPAREN);
                break;
#endif

Fine, let's conform to a standard that got it wrong.

And so back to PCRE - unfortunately, you deleted the section which
showed how to get REG_NEWLINE characteristics.  Besides, these are
Perl-compatible regular expressions, which seem to be much more in
demand that POSIX ones.

I don't get the POSIX religious thing, especially when it's a flawed
standard.

Regards,
Alistair