Subject: regexec(3) is broken or I am
To: netbsd-help <netbsd-help@netbsd.org>
From: None <jklowden@schemamania.org>
List: netbsd-help
Date: 11/28/2005 18:41:42
I thought I understood regexec(3), but either I don't or it doesn't.
Would any of the assembled gurus like to debug my code? Please?
AIUI, regcomp(3) returns the number of parenthesized subexpressions
in regex_t::re_nsub. Then regexec fills the regmatch_t array with
offset pairs. The first array element describes the overall match,
and the subsequent elements describe the subexpressions, in order
of appearance by left parenthesis.
I find the following anomalies:
1. rm_so and rm_eo are zero-based if no parentheses are present,
else are 1-based.
2. Standard character class names aren't recognized, even with
REG_EXTENDED.
3. Offsets are often wrong.
4. Multiple parenthesized subexpressions don't match.
5. If the RE matches more than once, the last match should be
returned, but isn't.
I can't believe the library is that broken. I didn't find a bug
report. Google turned up several uses of regexec in the /usr/src,
but none (that I noticed) that use multiple subexpressions.
I'm sure it's my test program, but for the life of me I don't see
what's wrong. The attached simple program and output illustrates
the problems.
I am using 2.0_BETA:
$ uname -a |sed 's/autobuild.*$//'
NetBSD hello.acml.com 2.0_BETA NetBSD 2.0_BETA (GENERIC) #0: Thu Sep 23 02:37:20 UTC 2004
Am I writing the wrong program, or misunderstanding the documentation?
Many thanks for your kind attention.
--jkl
=-= snip =-=
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <libgen.h>
#include <assert.h>
#include <locale.h>
#include <regex.h>
int
main(int argc, char *argv[])
{
int i;
regex_t re;
regmatch_t matches[20];
size_t erc;
char msg[256];
char *name_pattern = getenv("RE");
assert(argc > 1);
const char *argument = argv[1];
printf("pattern:\t%s\n", name_pattern);
erc = regcomp(&re, name_pattern, REG_EXTENDED|REG_ICASE);
if (erc) {
regerror(erc, &re, msg, sizeof(msg));
printf("error %d: rpc.c:%d: %s\n", erc, __LINE__, msg);
exit(1);
}
printf("pattern OK:\tre_nsub = %u\n", re.re_nsub);
printf("input: \t%s\n", argument);
memset(matches, '\0', sizeof(matches));
erc = regexec(&re, name_pattern, 1 + re.re_nsub, matches, 0);
if (erc) {
regerror(erc, &re, msg, sizeof(msg));
printf("error %d: rpc.c:%d: %s\n", erc, __LINE__, msg);
exit(1);
}
for (i=0; i <= re.re_nsub; i++) {
int c;
assert(matches[i].rm_so >= 0 );
printf("RE %d: (%d-%d)\t", i, (int)matches[i].rm_so, (int)matches[i].rm_eo);
for (c=0; c < matches[i].rm_eo - 1; c++) {
if (c < matches[i].rm_so - 1) {
putchar(' ');
} else {
putchar(argument[c]);
}
}
printf("\n");
}
regfree(&re);
exit(0);
}
=-= /snip =-=
=-= output =-=
$ for RE in `cat re`; do ./argparse "exec sp_test1"; echo - - -; done
## wrong offset, should be 8 or 9.
pattern: test
pattern OK: re_nsub = 0
input: exec sp_test1
RE 0: (0-4) exe
- - -
pattern: (test)
pattern OK: re_nsub = 1
input: exec sp_test1
RE 0: (1-5) exec
RE 1: (1-5) exec
- - -
## ^ is no good?
pattern: exec
pattern OK: re_nsub = 0
input: exec sp_test1
RE 0: (0-4) exe
- - -
pattern: ^exec
pattern OK: re_nsub = 0
input: exec sp_test1
error 1: rpc.c:45: regexec() failed to match
- - -
pattern: (exec)
pattern OK: re_nsub = 1
input: exec sp_test1
RE 0: (1-5) exec
RE 1: (1-5) exec
- - -
## multiple subexpressions
pattern: (ex(ec))
pattern OK: re_nsub = 2
input: exec sp_test1
error 1: rpc.c:45: regexec() failed to match
- - -
pattern: (ex)(ec)
pattern OK: re_nsub = 2
input: exec sp_test1
error 1: rpc.c:45: regexec() failed to match
- - -
pattern: ([[:alnum:]])
pattern OK: re_nsub = 1
input: exec sp_test1
RE 0: (4-5) c
RE 1: (4-5) c
- - -
## Standard character class names don't work?
## (expect to match 'test1')
pattern: ([[:alnum:]]+)
pattern OK: re_nsub = 1
input: exec sp_test1
RE 0: (4-9) c sp_
RE 1: (4-9) c sp_
- - -
## (expect to match 'sp_test1')
pattern: ([_[:alnum:]]+)
pattern OK: re_nsub = 1
input: exec sp_test1
RE 0: (2-3) x
RE 1: (2-3) x
- - -
## (expect to match 'sp_test1')
pattern: ([_[a-z0-9]+)
pattern OK: re_nsub = 1
input: exec sp_test1
RE 0: (1-5) exec
RE 1: (1-5) exec
=-= /output =-=