Subject: Re: the state of regex(3)
To: Christos Zoulas <christos@zoulas.com>
From: Alistair Crooks <agc@pkgsrc.org>
List: tech-userlevel
Date: 09/28/2004 22:17:44
On Tue, Sep 28, 2004 at 06:40:55PM +0000, Christos Zoulas wrote:
> 4. POSIX conformance: REG_NEWLINE will not follow POSIX, according to the docs.
> 
> So license is fine, code is not our style and not my favorite to maintain,
> but not a real showstopper (although it would be nice if the author was
> convinced to follow a more traditional style). Docs are ok, but the real
> stickler is POSIX conformance, or isn't it?

My reading of the docs shows that the default POSIX behaviour is the
same, and I know of no way to change the POSIX REG_NEWLINE regex
engine behaviour from the command line on egrep(1) or awk(1) (for
example).

pcre.txt says on this matter:

       This area is not simple, because POSIX and Perl take different views of
       things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
       then  PCRE was never intended to be a POSIX engine. The following table
       lists the different possibilities for matching  newline  characters  in
       PCRE:
         
                                 Default   Change with

         . matches newline          no     PCRE_DOTALL
         newline matches [^a]       yes    not changeable
         $ matches \n at end        yes    PCRE_DOLLARENDONLY
         $ matches \n in middle     no     PCRE_MULTILINE
         ^ matches \n in middle     no     PCRE_MULTILINE

       This is the equivalent table for POSIX:

                                 Default   Change with

         . matches newline          yes    REG_NEWLINE
         newline matches [^a]       yes    REG_NEWLINE
         $ matches \n at end        no     REG_NEWLINE
         $ matches \n in middle     no     REG_NEWLINE
         ^ matches \n in middle     no     REG_NEWLINE

       PCRE's behaviour is the same as Perl's, except that there is no equiva-
       lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
       no way to stop newline from matching [^a].

       The   default  POSIX  newline  handling  can  be  obtained  by  setting
       PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
       behave exactly as for the REG_NEWLINE action.

So default POSIX newline handling is possible with PCRE.

In our whole src tree, I can find the following uses of REG_NEWLINE:

usr.bin/m4/gnum4.c:       REG_NEWLINE | REG_EXTENDED);
usr.bin/nl/nl.c:                  &argstr[1], REG_NEWLINE|REG_NOSUB)) != 0) {
usr.sbin/user/user.c: if (regcomp(&r, line, REG_EXTENDED|REG_NEWLINE) != 0) {

and these could be converted to PCRE fairly easily, I would have said.

If Jason could help me out and tell me exactly what the sticking point
is, I'd be grateful. Is it any worse than defining POSIX_MISTAKE for
libc builds? (and, yes, I know what POSIX_MISTAKE is for, I'm talking
about the whole area of POSIX regular expressions).

Regards,
Alistair