tech-userlevel: Re: the state of regex(3)

Subject: Re: the state of regex(3)
To: NetBSD Userlevel Technical Discussion List <tech-userlevel@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 09/28/2004 17:26:29
[ On Tuesday, September 28, 2004 at 17:38:21 (+0100), Alistair Crooks wrote: ]
> Subject: Re: the state of regex(3)
>
> With thanks to Greg for his benchmarking, which I've deleted, but is in
> the archive.

Well just to throw another monkey wrench into the mix I should point out
that since I made that post back in January I've discovered that a
version of Henry Spencer's more recent re-implementation of the regex(3)
library is available, for example in recent releases of TCL.  It has a
very liberal copyright license (despite the fact that some folks,
e.g. even Jeffrey Friedl, have mistakenly assumed he's only released it
for use in TCL to date -- that's simply not true -- the only released
code is embedded in TCL, but it is separately licensed from TCL and it
can be extracted into a separate package by anyone with the programming
skills necessary to do so, and this has already been done by the
PostgreSQL folks, among others).

Henry's new library implements what he calls "_advanced_ REs".  The new
library also supports POSIX Basic REs and POSIX Extended REs.  His
documentation sasy that "POSIX EREs are almost an exact subset of AREs."

Most of the extensions in AREs are based on ideas from Perl5's RE
extensions, though it seems AREs are not quite compatible with full
Perl5 REs, nor with PCRE's implementation of Perl5 REs.  AREs are in
some senses a superset of Perl5 REs, especially the metasyntax.

As for speed, well I've not yet done a careful benchmark of this new
library vs. anything else.  Henry's new library has been described as a
hybrid DFA/NFA engine that dynamically uses the best algorithm for any
given RE.  Henry posted the following note about its capabilities:

    In general, the Tcl RE-matching engine is much less sensitive to the exact     
    form of the RE than traditional matching engines.  Things that it does         
    quickly will be fast no matter how you write them; things that it does         
    slowly will be slow no matter how you write them.  The old folklore about      
    hand-optimizing your REs simply does not apply.                                

(from <URL:http://groups.google.ca/groups?selm=Ft0Lqv.76o%40spsystems.net&output=gplain>)

In his book "Mastering Regular Expressions" Jeffrey Friedl goes so far
as to say "If this technology [Henry's hybrid DFA/NFA engine] were more
widespread, much of this chapter would not be needed." (referring to
chapter 6 "Crafting an Efficient Expression")

One thing I can say about what I've learned of PCRE performance over the
past six months or so is that it's extremely easy to accidentally cause
PCRE do do a lot of backtracking (especially on large strings) and not
always easy or sometimes even possible to rewrite the epxression to stop
or limit backtracking, at least not given my meager understanding of
PCRE optimisations.

Size-wize Henry's new code, including the locale-specific code added by
Scriptics Corp., is only ~9500 lines (PCRE-4.5 is over 13,300 lines),
though the i386 object compares as follows (PCRE has more internal
documentation :-):

$ size libregex.a
text    data    bss     dec     hex     filename
160     0       0       160     a0      regfronts.o (ex libregex.a)
28      0       0       28      1c      regfree.o (ex libregex.a)
7936    0       0       7936    1f00    regexec.o (ex libregex.a)
1056    288     0       1344    540     regerror.o (ex libregex.a)
32264   936     0       33200   81b0    regcomp.o (ex libregex.a)
                        =====
                        42668

$ size libpcre.a libpcreposix.a                                   
text    data    bss     dec     hex     filename
736     0       0       736     2e0     maketables.o (ex libpcre.a)
756     0       0       756     2f4     get.o (ex libpcre.a)
1856    0       0       1856    740     study.o (ex libpcre.a)
29796   1108    0       30904   78b8    pcre.o (ex libpcre.a)
3340    0       0       3340    d0c     pcreposix.o (ex libpcreposix.a)
                        =====
                        37592

I guess I should ask Henry directly whether he's done any work recently
to create a proper separate release or not....

I should also update all of the regex codebases I've benchmarked, add
Henry's latest code to the list, and at least re-run my basic benchmark
again.  (Apparently TRE has improved vastly since my last run too!)

BTW, Doug McIlroy wrote a regex(3) test harness (since updated and now
maintained by Glenn Fowler (AT&T Labs)) (Google for testregex.c +
Fowler) which we should use in our regression tests.  There's a good
sized set of test data for it too.

If I had more time I would try pulling Henry's latest library into a
copy of my netbsd-1-6 tree too.  Doing that may even be much easier to
do than integrating PCRE....

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>