Subject: Re: the state of regex(3)
To: None <tech-userlevel@netbsd.org>
From: Christos Zoulas <christos@zoulas.com>
List: tech-userlevel
Date: 09/28/2004 18:40:55
In article <20040928163821.GA8789@nef.pbox.org>,
Alistair Crooks <agc@pkgsrc.org> wrote:
>On Fri, Jan 02, 2004 at 05:48:18PM -0500, Greg A. Woods wrote:
>> I had forgotten that I had a basic bit of test harness for doing simple
>> regex testing and benchmarking with the egrep implementation by James
>> Howard and Dag-Erling Sm?rgrav (which is just a wrapper around any POSIX
>> regex library).  Remembering this prompted me to fetch and compile the
>> latest versions of the various libraries mentioned so far and give them
>> each a test run.
>> [...] 
>> For the rest here are some timing results from the following silly test
>> I use to find obvious viruses in e-mail, as run across about 64MB of
>> accumulated virus e-mail.  So far PCRE is the clear winner by a country
>> mile and TRE is way ahead of the rest of the pack.  TRE will probably
>> also improve quite a bit more before there's a 1.x release of it.  TRE
>> has become very much more interesting in the latest release too -- it
>> now has true support for approximate pattern matching using real EREs
>> (i.e. in a manner vastly superior to the old agrep).
>
>With thanks to Greg for his benchmarking, which I've deleted, but is in
>the archive.
>
>Thomas Klausner has just updated the PCRE package to 5.0. It's
>interesting to note that this update says:
>
>	Log Message:
>	Update to 5.0:
>
>	Release 5.0 13-Sep-04
>	---------------------
>
>	The licence under which PCRE is released has been changed to the more
>	conventional "BSD" licence.
>
>	In the code, some bugs have been fixed, and there are also some major changes
>	in this release (which is why I've increased the number to 5.0). Some changes
>	are internal rearrangements, and some provide a number of new facilities.
>
>Assuming that the internal rearrangements have not clobbered the performance
>in any way, is there any reason to stay with the old regex(3) implementation?
>Shouldn't we just move to pcre?

Well,

1. The license is indeed BSD, but formatted differently.
2. The code is indented in a GNUish style with the following differences:
   - code starts at column 0
   - compound statements are sometimes in the same line:
	if (blaf) { foo; }
   - sometimes if/then/else statements are formatted like:
        if (blaf) a = b; else
	  {
	  c = d;
	  }
   - othertimes the indentation rules are more complex:
      if (a) 
        {
        if (a == b) c = d;
          else if (a == d) f = g;
        else
          {
	  e=h;
          }
        } 
3. The documentation looks ok, but will need some cleanup.
4. POSIX conformance: REG_NEWLINE will not follow POSIX, according to the docs.

So license is fine, code is not our style and not my favorite to maintain,
but not a real showstopper (although it would be nice if the author was
convinced to follow a more traditional style). Docs are ok, but the real
stickler is POSIX conformance, or isn't it?

christos