Subject: Re: the state of regex(3)
To: Alistair Crooks <agc@pkgsrc.org>
From: Ian Lance Taylor <ian@airs.com>
List: tech-userlevel
Date: 09/30/2004 16:52:23
Alistair Crooks <agc@pkgsrc.org> writes:

> There are certain restrictions on POSIX regular expressions; one is
> that the *USER* should keep their expressions below 256 characters in
> length, to keep them portable - see re_format(7).  Whilst NetBSD's
> regex code can handle longer expressions, someone else's
> POSIX-conformant code may not.  One drawback of standards going for
> the lowest common denominator.  Are you going to add a warning message
> to regcomp(3) for every regexp which is 256 chars or greater, just so
> that POSIX-conformance is assured?

I don't follow this--POSIX doesn't prohibit a POSIX conformant
implementation from supporting larger regexps, it merely requires that
a POSIX conformant application avoid using larger regexps.  While one
could add a warning to regcomp to help the cause of writing a POSIX
conformant application, such a warning should be optional.

> There is also POSIX_MISTAKE - take a look at src/lib/libc/regex/regcomp.c
> 
> #ifndef POSIX_MISTAKE
>         case ')':               /* happens only if no current unmatched ( */
>                 /*
>                  * You may ask, why the ifndef?  Because I didn't notice
>                  * this until slightly too late for 1003.2, and none of the
>                  * other 1003.2 regular-expression reviewers noticed it at
>                  * all.  So an unmatched ) is legal POSIX, at least until
>                  * we can get it fixed.
>                  */
>                 SETERROR(REG_EPAREN);
>                 break;
> #endif
> 
> Fine, let's conform to a standard that got it wrong.

For what it's worth, the GNU approach is to check the environment
variable POSIXLY_CORRECT, and only strictly adhere to the standard
when that variable is defined.  A somewhat similar case is ISO C
trigraphs--you don't normally want your compiler to implement
trigraphs, which basically mung your strings in weird ways, but
support is required for ISO C conformance; gcc only implements them
when the -ansi option is used.

> I don't get the POSIX religious thing, especially when it's a flawed
> standard.

I think that POSIX conformance, albeit user controlled, is desirable.
If nothing else, it permits writing highly portable application code.
And it is a selling point for NetBSD.

Ian