Subject: regexec(3) is broken or I am
To: netbsd-help <netbsd-help@netbsd.org>
From: None <jklowden@schemamania.org>
List: netbsd-help
Date: 11/28/2005 18:41:42
I thought I understood regexec(3), but either I don't or it doesn't.
Would any of the assembled gurus like to debug my code?  Please? 

AIUI, regcomp(3) returns the number of parenthesized subexpressions
in regex_t::re_nsub.  Then regexec fills the regmatch_t array with
offset pairs.  The first array element describes the overall match,
and the subsequent elements describe the subexpressions, in order
of appearance by left parenthesis.

I find the following anomalies:

1.  rm_so and rm_eo are zero-based if no parentheses are present,
else are 1-based.

2.  Standard character class names aren't recognized, even with
REG_EXTENDED.

3.  Offsets are often wrong. 

4.  Multiple parenthesized subexpressions don't match.  

5.  If the RE matches more than once, the last match should be 
returned, but isn't. 

I can't believe the library is that broken.  I didn't find a bug
report.  Google turned up several uses of regexec in the /usr/src,
but none (that I noticed) that use multiple subexpressions.  
I'm sure it's my test program, but for the life of me I don't see
what's wrong.  The attached simple program and output illustrates
the problems.

I am using 2.0_BETA:
$ uname -a |sed 's/autobuild.*$//'
NetBSD hello.acml.com 2.0_BETA NetBSD 2.0_BETA (GENERIC) #0: Thu Sep 23 02:37:20 UTC 2004  

Am I writing the wrong program, or misunderstanding the documentation?  

Many thanks for your kind attention.

--jkl

=-= snip =-=
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <libgen.h>

#include <assert.h>
#include <locale.h>
#include <regex.h>

int
main(int argc, char *argv[]) 
{
	int i;
	regex_t re;
	regmatch_t matches[20];
	size_t erc;
	char msg[256];
	
	char *name_pattern = getenv("RE");
	
	assert(argc > 1);
	const char *argument = argv[1];
	
	printf("pattern:\t%s\n", name_pattern);
	
 	erc = regcomp(&re, name_pattern, REG_EXTENDED|REG_ICASE);
	
	if (erc) {
		regerror(erc, &re, msg, sizeof(msg));
		printf("error %d: rpc.c:%d: %s\n", erc, __LINE__, msg);
		exit(1);
	}
	
	printf("pattern OK:\tre_nsub = %u\n", re.re_nsub);
	
	printf("input:  \t%s\n", argument);
	
	memset(matches, '\0', sizeof(matches));
	
	erc = regexec(&re, name_pattern, 1 + re.re_nsub, matches, 0);
	
	if (erc) {
		regerror(erc, &re, msg, sizeof(msg));
		printf("error %d: rpc.c:%d: %s\n", erc, __LINE__, msg);
		exit(1);
	}
	
	for (i=0; i <= re.re_nsub; i++) {
		int c;
		
		assert(matches[i].rm_so >= 0 );
		printf("RE %d: (%d-%d)\t", i, (int)matches[i].rm_so, (int)matches[i].rm_eo);

		for (c=0; c < matches[i].rm_eo - 1; c++) {
			if (c < matches[i].rm_so - 1) {
				putchar(' ');
			} else {
				putchar(argument[c]);
			}
		}
		printf("\n");
	}

	regfree(&re);
	
	exit(0);
}
=-= /snip =-=

=-= output =-=
$ for RE in `cat re`; do  ./argparse "exec sp_test1"; echo - - -; done
## wrong offset, should be 8 or 9.
pattern:        test
pattern OK:     re_nsub = 0
input:          exec sp_test1
RE 0: (0-4)     exe
- - -
pattern:        (test)
pattern OK:     re_nsub = 1
input:          exec sp_test1
RE 0: (1-5)     exec
RE 1: (1-5)     exec
- - -
## ^ is no good?  
pattern:        exec
pattern OK:     re_nsub = 0
input:          exec sp_test1
RE 0: (0-4)     exe
- - -
pattern:        ^exec
pattern OK:     re_nsub = 0
input:          exec sp_test1
error 1: rpc.c:45: regexec() failed to match
- - -
pattern:        (exec)
pattern OK:     re_nsub = 1
input:          exec sp_test1
RE 0: (1-5)     exec
RE 1: (1-5)     exec
- - -
## multiple subexpressions
pattern:        (ex(ec))
pattern OK:     re_nsub = 2
input:          exec sp_test1
error 1: rpc.c:45: regexec() failed to match
- - -
pattern:        (ex)(ec)
pattern OK:     re_nsub = 2
input:          exec sp_test1
error 1: rpc.c:45: regexec() failed to match
- - -
pattern:        ([[:alnum:]])
pattern OK:     re_nsub = 1
input:          exec sp_test1
RE 0: (4-5)        c
RE 1: (4-5)        c
- - -
## Standard character class names don't work?
## (expect to match 'test1')
pattern:        ([[:alnum:]]+)
pattern OK:     re_nsub = 1
input:          exec sp_test1
RE 0: (4-9)        c sp_
RE 1: (4-9)        c sp_
- - -
## (expect to match 'sp_test1')
pattern:        ([_[:alnum:]]+)
pattern OK:     re_nsub = 1
input:          exec sp_test1
RE 0: (2-3)      x
RE 1: (2-3)      x
- - -
## (expect to match 'sp_test1')
pattern:        ([_[a-z0-9]+)
pattern OK:     re_nsub = 1
input:          exec sp_test1
RE 0: (1-5)     exec
RE 1: (1-5)     exec
=-= /output =-=