Subject: Re: the state of regex(3)
To: NetBSD Userlevel Technical Discussion List <tech-userlevel@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 01/02/2004 17:48:18
I had forgotten that I had a basic bit of test harness for doing simple
regex testing and benchmarking with the egrep implementation by James
Howard and Dag-Erling Smørgrav (which is just a wrapper around any POSIX
regex library).  Remembering this prompted me to fetch and compile the
latest versions of the various libraries mentioned so far and give them
each a test run.

Here are the sizes of the static linked binary using the various
libraries:

NetBSD i386 w/ GCC 2.95.3 nb3
		text	data	bss	dec	hex	filename
NetBSD-regex	144646 	8688   	12856  	166190 	2892e  	grep
pcre-4.5	153982  9776    12824   176582  2b1c6   grep
tre-0.6.4	156046 	8944   	12824  	177814 	2b696  	grep
rx-1.5		156810 	10512  	13240  	180562 	2c152  	grep
onig-20031224	200490 	18320  	12824  	231634 	388d2  	grep

Unfortunately I get an immediate core dump from the Oniguruma library
which looks to be a bug in its POSIX API interface code.

For the rest here are some timing results from the following silly test
I use to find obvious viruses in e-mail, as run across about 64MB of
accumulated virus e-mail.  So far PCRE is the clear winner by a country
mile and TRE is way ahead of the rest of the pack.  TRE will probably
also improve quite a bit more before there's a 1.x release of it.  TRE
has become very much more interesting in the latest release too -- it
now has true support for approximate pattern matching using real EREs
(i.e. in a manner vastly superior to the old agrep).

    /usr/bin/time -l ./grep -D -E -i \
	-e 'The file was successfully deleted by RAV AntiVirus' \
	-e 'I send you this file in order to have your advice' \
	-e '^TV[nopqr][A-Z]...[AB]..A.A....*AAAA...*AAAA' \
	-e '^M35[GHIJK].`..`..*````' \
	-e '^[	 ]*content-(disposition|type).*name[	 ]*=[	 ]*"?(.*\.(386|acm|ade|adp|app|asp|awx|ax|bas|bat|bin|cdf|chm|class|cmd|cnv|com|cpl|crt|csh|dll|dlo|doc|dot|drv|exe|flt|fot|hlp|hta|ini|inf|ins|isp|js|jse|lnk|mdb|mde|mod|msc|msi|msp|mst|nws|obj|ocx|olb|osd|ovl|pcd|pdr|pgm|pif|pkg|pot|ppt|pps|prg|reg|rpl|rtf|scr|script|sct|sh|sha|shtml|shs|swf|sys|tlb|tsp|ttf|vb|vlm|vxd|vxo|wiz|wll|wwk|pdr|url|vb|vbe|vbs|wsc|wsf|wsh|xla|xlb|xlc|xld|xlk|xll|xlm|xls|xlt|xlv|xlw|xnk))"?[	 ]*$' \
	/mfbd/woods/virii > test.out


NetBSD-regex:
      192.77 real       191.70 user         0.04 sys
         0  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      2049  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         1  block output operations
         2  messages sent
         0  messages received
         0  signals received
         2  voluntary context switches
      2641  involuntary context switches

pcre-4.5:
        9.21 real         8.84 user         0.03 sys
         0  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      2046  page reclaims
         3  page faults
         0  swaps
         0  block input operations
         0  block output operations
        11  messages sent
         0  messages received
         0  signals received
        16  voluntary context switches
       145  involuntary context switches

tre-0.6.4:
       65.11 real        64.30 user         0.13 sys
         0  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      2344  page reclaims
         4  page faults
         0  swaps
         0  block input operations
         0  block output operations
        12  messages sent
         0  messages received
         0  signals received
        16  voluntary context switches
       942  involuntary context switches

rx-1.5:
      140.42 real       139.26 user         0.11 sys
         0  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
      3401  page reclaims
         4  page faults
         0  swaps
         0  block input operations
         0  block output operations
        12  messages sent
         0  messages received
         0  signals received
        16  voluntary context switches
      1994  involuntary context switches


FYI those tests were run on a system with a PIII-700MHz CPU and 1GB RAM

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>