Subject: Re: the state of regex(3) (was: Policy questions)
To: NetBSD Userlevel Technical Discussion List <tech-userlevel@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 01/01/2004 03:05:10
[ On Monday, December 29, 2003 at 10:49:08 (-0500), Perry E. Metzger wrote: ]
> Subject: Re: Policy questions
>
> "Bruce J.A. Nourish" <bjan+tech-userlevel@bjan.net> writes:
> >  * The man page to regex(3), circa 1994, claims that regex(3) is "an
> >    alpha release with known defects." As it has been rigorously tested
> >    for almost 10 years, could we remove that comment?
> 
> Well, it is certainly no longer an alpha release. :)
> 
> Are there still known defects?

It's deathly slow.  (compared to many (all? :-) other implementations)

According to Henry speeding it up requires a complete rewrite, one he
was apparently working on for some time, but hasn't come close to
finishing yet.

I've been hunting around off and on for a couple of years now for a
POSIX compatible ERE library to replace NetBSD's regex with, but
unfortunately I haven't found many that are _not_ currently covered by
the GNU copyright licence, and not even many that are ready with POSIX
compatability.

There is Tatu Ylonen's 1991 release of his regexpr library.  It's
covered by a reasonable copyright, but I don't think it's as stable as
Henry's code despite being quite mature.  It implements Emacs REs, as
well as traditional "egrep" EREs, and if I remember correclty it is
quite a bit faster than Henry's code.  I think it would still need quite
a bit of work to be made POSIX compatible, but it might be a good
starting place.

Ozan's public domain regex library is quite fast, but not nearly POSIX
compatible in any way shape or form.  I've talked to him about adding a
POSIX API, and possibly about fleshing out POSIX ERE support too, but he
hasn't found the time, nor have I.

BWK's AWK has a rather fast and somewhat extended RE implementation
too, and it's under a reasonable copyright as well, but it too would
need some work to support a POSIX compatible API (and it's REs are not
quite POSIX compatible either).

Another very fast and stable ERE implementation can be found in mawk.
It would would also need quite a bit of hacking to make it more POSIX
compatible.  I've been meaning to ask Michael Brennan if he would
consider relicensing it under a more BSD-like copyright.

PCRE is very fast and stable too (as well as being very well
documented), but although it includes a POSIX API it only implements
Perl-compatible EREs, not true POSIX EREs.  I seriously doubt Philip
would consider relicensing it under a more BSD-like copyright though,
but I suppose it wouldn't hurt to ask.

Brenton Hoff's Grouse Grep is also amazingly fast (often faster than GNU
Grep, and certainly _much_ smaller!), but I think it would require some
considerable work to extract a POSIX compatible API regex library from
it, and apparently the new version is now GPL'ed too.

I should also mention Andy Key's Extended Regular Expression Handler.
It's fullly public domain and supports a good portion of POSIX ERE
syntax.  However it doesn't have a POSIX API.  Unfortunately I've not
had a chance to really benchmark it yet, though it's apparently
reasonably fast -- just not quite as fast as GNU Regex (which means it
should still be a lot faster than Henry's regex :-).  It should be
easier to add a POSIX API wrapper to it than it would be to any of the
others.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>