NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

misc/40387: Serious bug in regular expression library (regex) affected sed



>Number:         40387
>Category:       misc
>Synopsis:       Serious bug in regular expression library (regex) affected sed
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    misc-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jan 13 14:10:02 +0000 2009
>Originator:     Chris Kuklewicz
>Release:        4.0 and 4.99.59
>Organization:
Not Applicable
>Environment:
I have no output to paste here
>Description:
I have tracked the source of a bug originally seen on OS X back to the same bug 
in FreeBSD and NetBSD.

The bug is in the libc implementation of regular expressions, used via the API 
in "regex.h" specifically regexec.

Programs which use this library are affected, including "sed" which I will use 
to illustrate the bug below.

The bug effects both the whole text matched and the choice of text reported as 
matching parenthesized subexpressions.  The bug violates the leftmost longest 
matching rule for POSIX regular expressions.  The bug violates the principle 
that the order of subexpressions around the "|" operator should not affect the 
whole text matched.

A terminal sessions using three sed commands with extended regular expressions:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
a[b]
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[ab]
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 
's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
[XababaYababaZ] => [X][ab][aba][Y][b][Z]

The first two commands differ only in the order of "()" and "." around the "|" 
operator.  According to the POSIX specification this must not affect the whole 
match, which is report by "&" in sed's replacement specification.  The leftmost 
longest match is "ab" and is correctly reported by the second command.  The 
first command incorrect does not match starting with the "a" and matches just 
the "b".

This establishes that the bug violates the POSIX specification and can affect a 
program that only cares about the whole text matched (ignoring captured 
parenthesized subexpressions).

The third command shows an internal inconsistency in the output.  The 
(aba|b|ab)* operator should capture the text matched by the last repetition of 
the (aba|b|ab) subexpression and report that in \6. In particular \6 should 
match "aba" just like \3 does.

As shown above the FreeBSD output is [XababaYababaZ] => [X][ab][aba][Y][b][Z]
The correct output should have been  [XababaYababaZ] => [X][ab][aba][Y][abc][Z]

Could there be a way to explain the FreeBSD output? Consider the following 
questions:

If the last repetition of (aba|b|ab)* matched \6 to "b" then how is the text 
between that "b" skipped so that "(Z)" matched \7 against the final "Z" in the 
text?
If the "b" reported by \6 is the last "b" in the text then there is no part of 
the pattern that can match the next "a".
If the "b" reported by \6 is the next-to-last "b" in the text then there is no 
part of the pattern that can match the preceding "a".
Checking the indices returned by regex shows that \6 was matched against the 
last "b" in the text.
Thus reporting \6 as matching "b" is contradictory, there is no consistent 
interpretation of what regexec returned.

This establishes that the bug is can violate not only the POSIX specification 
but also violate any rational and consistent alternate specification.

I found the first example by using Haskell's QuickCheck to randomly generate 
test cases.  The second example was found by using Glenn Fowler's "AT&T Labs 
Research regex(3) regression tests" from 
http://www.research.att.com/~gsf/testregex/
I "tested" this bug on NetBSB by asking on the #netbsd channel of 
irc.freenode.net and ASau and sjamaan answered.
I tested FreeBSD 6.3-PRERELEASE thanks to the free shell account at 
m-net.arbornet.org and FreeBSD 8.0-CURRENT was tested by "methi" on the 
#freebsd channel at irc.freenode.net

>How-To-Repeat:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)

If there is no bug then the first two both output "[ab]" and the last output 
"[XababaYababaZ] => [X][ab][aba][Y][abc][Z]".

Currently the bug causes the first the output "a[b]" and the last to output 
"[XababaYababaZ] => [X][ab][aba][Y][b][Z]".

>Fix:
I do not have the time to be able to fix the regex.h part of libc.



Home | Main Index | Thread Index | Old Index