NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

bin/46155: awk: NetBSD-current record splitting is broken



>Number:         46155
>Category:       bin
>Synopsis:       awk: NetBSD-current record splitting is broken
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Mar 08 22:10:01 +0000 2012
>Originator:     Miguel Piñeiro Jr
>Release:        NetBSD 5.1 but affected code is from -current
>Organization:
>Environment:
NetBSD 5.1 NetBSD 5.1 (GENERIC) #0: Sun Nov 7 14:39:56 UTC 2010  
builds%b6.netbsd.org@localhost:/home/builds/ab/netbsd-5-1-RELEASE/i386/201011061943Z-obj/home/builds/ab/netbsd-5-1-RELEASE/src/sys/arch/i386/compile/GENERIC
 i386

To be a bit more specific about the environment, NetBSD 5.1 is a guest in 
VMWare Fusion 2.0.2 on an Intel Mac running OS X 10.4.11.
>Description:
Hello, NetBSD community:

NetBSD-current's AWK is fundamentally broken and has been since the attempt to 
add support for RS regular expressions. As it stands, it cannot be trusted to 
generate correct records for all valid values of RS. If you're lucky, it'll 
error out. If you're not, it will silently do the wrong thing.

Note that not only is the regular expression case itself incorrectly 
implemented, but the once-functioning multiline-record handling (empty RS) is 
now also broken. Only the case of a single character RS -- no more, no less -- 
works correctly.

The readrec() changes in the following diff to lib.c are to blame:

http://cvsweb.netbsd.org/bsdweb.cgi/src/external/historical/nawk/dist/lib.c.diff?r1=1.1&r2=1.2&only_with_tag=MAIN

When I stumbled across this issue, while curiously surveying the code of 
several AWK implementations, I hadn't used NetBSD in over 10 years. Safe to say 
that I'm a newbie again. Following what seemed to me the most expedient path, I 
installed the latest release iso and checked out -current's nawk. This is 
almost certainly not a supported configuration, but I doubt it has affected my 
analysis.

My testing was conducted using the code in cvs HEAD of 
src/external/historical/nawk, compiled from that directory with `make 
MKINFO=no`.

Line numbers herein refer to 
http://cvsweb.netbsd.org/bsdweb.cgi/src/external/historical/nawk/dist/lib.c?annotate=1.4&only_with_tag=MAIN




CASE 1:  RS IS EMPTY (multiline records)
readrec(): lines 205-210


When RS is empty, readrec() will do nothing but skip leading newlines.  Once 
those newlines have been read (if indeed they were present), subsequent calls 
to readrec() result in a practical no-op, getc->ungetc.  nawk will be stuck in 
place, never reaching the end of the stream, never terminating, unless it 
implodes.

Also, when RS is empty, the automatic variable rr is never initialized.  It is 
declared on line 192 and then used in an adjbuf() call on line 268.

Within adjbuf(), the result of 1+rr-buf is eventually passed to realloc.  In 
Test 1.2 below, this led to realloc() failure.  Should realloc() not lead to 
abnormal termination of the getc-ungetc stalemate (as in Test 1.4), 
dereferencing and assigning to rr may.

$ # Test 1.1
$ # Correct behavior with default RS
$ echo foo | ./nbawk-current 1
foo
$ # Test 1.2
$ # Implosion with null RS
$ echo foo | ./nbawk-current 1 RS=

./nbawk-current: out of memory in readrec 3
 input record number 1, file 
 source line number 1
$ # Test 1.3
$ # Correct behavior with default RS
$ echo foo | ./nbawk-current 'END {print "END"}'
END
$ # Test  1.4
$ # Should give a result identical to Test 1.3, but
$ # with null RS, perpetually increased cpu utilization warms the room.
$ echo foo | ./nbawk-current 'END {print "END"}' RS=
^C
$ # End of transcript




CASE 2:  RS IS A SINGLE CHARACTER
readrec(), lines 246-267


The common, single-character RS case performs as expected, but harbors 
unreachable code.

It's impossible to reach this statement block if **RS and sep are unequal, 
since that inequality can only arise within a mutually exclusive block (line 
206). Therefore, the test at line 256 always succeeds; the break at line 257 is 
unconditional; and lines 258-265 are unreachable.

This unreachable code was a critical component of multiline record handling. 
sep and **RS are always equal except when RS is empty, in which case sep is set 
to newline. This equality was used to distinguish single and multiline modes 
when their handling was intertwined.




CASE 3:  RS IS MULTICHARACTER (regular expression)
readrec(), lines 211-245.


There are a lot of mistakes in this section.

If adjbuf (line 220) relocates buf, it adjusts rr, but rrr and brr are left 
pointing to the buffer's previous location. The next call to nematch() would be 
looking in the wrong place for a string to match.  And, any attempt to break 
the stream-reading while-loop before the EOF would clobber rr, the pointer to 
the current character location in the buffer, with the invalid location in rrr 
(228-232). This would make it impossible to properly null-terminate the record 
(line 270), and would result in more memory management errors via adjbuf() 
(line 269).

Further, all of the code within the block following the while-loop is 
unreachable (lines 240-244).  There are only two conditions which exit the 
while loop and they are OR'd with each other on line 238.  That test cannot 
fail, hence the break (line 239) is unconditional.  Fortunately, this code 
doesn't belong here anyway; it is only required when handling multiline records 
(consecutive newlines delimiting a record).

These bugs notwithstanding, the entire approach's logic is deeply flawed.

For example, when nematch() finds a match and sets patbeg, line 235, brr = rrr 
= rr = patbeg, will lead to the next iteration's *rr++ = c, line 224, 
clobbering the first character of the matching pattern. This makes no sense 
whatsoever. 

Imagine what's happening when trying to match the regular expression /ab?/ 
against the text "1ab2". The "1" is read from the stream and null terminated. 
nematch then fails to match /ab?/ against "1\0".  The "a" is then read in and 
nematch succeeds in matching /ab?/ against "1a\0". patbeg, rr, and brr all now 
point to "a". The next iteration reads "b" into rr, clobbering "a", resulting 
in buf containing "1b\0".  nematch() is then passed brr which now points to 
"b\0".  Notice that nematch() never sees the original data's sequence of "ab", 
hence it is incapable of successfully matching "ab" against /ab?/. nematch() 
then correctly fails to match "1b" against /ab?/.  This failure following an 
earlier success triggers the ungetc'ing of "b" (which will show up as part of 
the second record instead of part of a record separator) and breaks the reading 
while-loop.

The critical problem with that algorithm is that nematch() never sees the 
original data's sequence of "ab", hence it never even has the opportunity to do 
the right thing when working with /ab?/; the data contains "1ab2" yet nematch() 
only sees "1", "1a", and "1b".  That is obviously incorrect.

$ # Test 3.1
$ # Correct behavior with typical, single-char RS
$ printf 1a2aa3aaa4 | ./nbawk-current 1 RS=a
1
2

3


4
$ # Test 3.2
$ # Erroneous behavior with a regular expression
$ # The following's output should be identical to RS=a
$ # There should be empty records.  Why is it matching multiple characters?
$ # Because each successive "a" is overwriting the previous "a" when patbeg
$ # is stored in rr.
$ printf 1a2aa3aaa4 | ./nbawk-current 1 RS='[a]'
1
2
3
4
$ # Test 3.3
$ # Again, erroneous behavior with a regular expression
$ # The following's output should be identical to Test 3.1.
$ # There should be empty records.  Why is it matching multiple characters?
$ # For the same reason as Test 3.2.
$ printf 1a2ab3aba4 | ./nbawk-current 1 RS='[ab]'
1
2
3
4
$ # Test 3.4
$ # Correct result because there is only one successful nematch.
$ # The number which overwrites the "a" is ungetc'd.
$ # Lucky break.
$ printf 1ab2ab3ab4 | ./nbawk-current 1 RS='ab'
1
2
3
4
$ # Test 3.5
$ # Correct behavior when attempting a greedy match
$ # A slightly different type of lucky break.
$ printf 1a2aa3aaa4 | ./nbawk-current 1 RS='aa*'
1
2
3
4
$ # Test 3.6
$ # Erroneous behavior when attempting a greedy match
$ # The following's output should be identical to the preceding example.
$ # Note that the second "a" in "aa" clobbers the first in the buffer, but
$ # since it matches the regular expression, it is not returned to the
$ # stream like each instance of "b", which winds up as part of the
$ # record when it should have been part of the separator.
$ printf 1a2ab3aab4 | ./nbawk-current 1 RS='aa*b*'
1
2
b3
b4
$ # End of transcript

There are more shortcomings than the simple cases shown above.  For example, 
any correct approach must be capable of backing up multiple characters, not 
just one.  For example, "abcdefh" ~ /a(bcdefz)*/ would require consuming 
through "h" to learn that the correct match is just "a".

In my opinion, regular expression support for RS cannot be correctly 
implemented using any of the matching functions in b.c.

match() finds only the shortest match, which is unacceptable for the case of 
determining the record separator.

pmatch() attempts a longest match but will accept an empty match in lieu of 
failure, which is also unacceptable.

nematch() will not accept empty matches and will return the earliest, longest 
possible match. It seems suitable upon casual inspection, but it's not.

To properly split data into records, we need to know when the automaton is 
incapable of finding a longer match. The possibility of a longer match exists 
so long as the stream has not reached the EOF and the current state contains 
following leaves. The automaton must be fed until one of these conditions is no 
longer true.

The way that the buffer-oriented nematch() works, we cannot tell when hope 
should be abandoned.  nematch() will return a true result if it finds a match 
before reaching the end of the buffer, an outcome characterized by the absence 
of further state transitions. This is what we're looking for.  However, 
nematch() will also return true if the buffer's end forces it to settle for a 
shorter match even though the end of the stream has not been reached and there 
are possible state transitions available.  Since the caller cannot distinguish 
the two -- the latter requires further data from the stream, the former does 
not -- nematch() in its present incarnation is inadequate.

If there was some way to determine the state of the automaton from outside 
nematch() -- perhaps by extending nematch()'s interface or by modifying the 
finite automaton's definition to retain state -- then nematch() could be used, 
although clumsily, to achieve a correct result.  However, I believe that an 
interface that directly interacts with the stream would be simpler.  The diffs 
that I've included below are my attempt at such an approach.


IMPLEMENTATION INCOMPATIBILITIES

There's at least two situations in which gawk and mawk disagree on how to 
handle a regular expression RS.  I've included two diffs, one for each behavior.

gawk: gawk-3.1.8nb1       GNU awk
mawk: mawk-1.3.4.20100625 AWK clone by Mike Brennan
nbawk-current: NetBSD-current's broken AWK implementation.
nbawk-fixed-mawk: NetBSD-current's AWK implementation with my mawk-behavior 
diff applied.
nbawk-fixed-gawk: NetBSD-current's AWK implementation with my gawk-behavior 
diff applied.

Within the scope of the tested scenarios, my diffs do not introduce any new 
area of incompatiblity. Disagreement with gawk or mawk only occurs when it is 
inevitable because gawk and mawk already disagree with each other. And in those 
cases, there is agreement with one of the implementations, a third alternative 
result is not introduced.

In the test results below, "SUCCESS" indicates that all AWK implementations 
yield identical output, "FAILURE" that they do not.  Failures are followed by 
each implementation's stdout and stderr.

First, let's take a look at gawk versus NetBSD's current awk implementation. 
It's not pretty. Incorrect results abound and there are a few instances of 
memory allocation errors.

SUCCESS: 000: printf 1a2aa3aaa4 | "$awk" 1 RS=a         # Single character RS
FAILURE: 001: printf 1a2aa3aaa4 | "$awk" 1 RS='[a]'             # Regular 
expression RS
gawk:
--> 1
--> 2
--> 
--> 3
--> 
--> 
--> 4
./nbawk-current:
--> 1
--> 2
--> 3
--> 4
FAILURE: 002: printf 1a2ab3aba4 | "$awk" 1 RS='[ab]'
gawk:
--> 1
--> 2
--> 
--> 3
--> 
--> 
--> 4
./nbawk-current:
--> 1
--> 2
--> 3
--> 4
SUCCESS: 003: printf 1ab2ab3ab4 | "$awk" 1 RS=ab
FAILURE: 004: printf 1a2aa3aaa4 | "$awk" 1 RS='a*'
gawk:
--> 
--> 
--> 
--> 
--> 
--> 
./nbawk-current:
--> 1
--> 2
--> 3
--> 4
SUCCESS: 005: printf 1a2aa3aaa4 | "$awk" 1 RS='aa*'
FAILURE: 006: printf 1a2ab3aab4 | "$awk" 1 RS='aa*b*'
gawk:
--> 1
--> 2
--> 3
--> 4
./nbawk-current:
--> 1
--> 2
--> b3
--> b4
SUCCESS: 007: printf 1abc | "$awk" 1 RS='abcde|b'
SUCCESS: 008: printf 1abcdf2 | "$awk" 1 RS='abcde|b'
FAILURE: 009: printf 1abcdebf2 | "$awk" 1 RS='abcde|b'
gawk:
--> 1
--> 
--> f2
./nbawk-current:
--> 1a
--> cde
--> f2
SUCCESS: 010: printf 1abcdf2 | "$awk" 1 RS='abcde|a'
SUCCESS: 011: printf a1a2a3a | "$awk" 1 RS='^a'         # Anchors
FAILURE: 012: printf aaa1a2a | "$awk" 1 RS='^a'
gawk:
--> 
--> aa1a2a
./nbawk-current:
--> 
--> 
--> 
--> 1a2a
FAILURE: 013: printf a1a2a3a | "$awk" 1 RS='a$'
gawk:
--> a1a2a3
./nbawk-current:
--> 
--> 1
--> 2
--> 3
FAILURE: 014: printf a1a2aaa | "$awk" 1 RS='a$'
gawk:
--> a1a2aa
./nbawk-current:
--> 
--> 1
--> 2
FAILURE: 015: jot -s a 10000 | "$awk" 'NR>1' RS='999[0-9]'      # Trigger 
adjbuf relocation
gawk:
--> a
--> a
--> a
--> a
--> a
--> a
--> a
--> a
--> a
--> a10000
--> 
./nbawk-current:
FAILURE: 016: printf foo | "$awk" 1 RS=                 # Empty RS (multiline 
records)
gawk:
--> foo
./nbawk-current:
--> 
--> 
--> 
--> 
--> ./nbawk-current: out of memory in readrec 3
-->  input record number 4, file 
-->  source line number 1
FAILURE: 017: printf '\n\n\nr1f1\nr1f2\n\nr2f1\nr2f2\n\n\n' | "$awk" '{$1=$1}1' 
RS= OFS=:
gawk:
--> r1f1:r1f2
--> r2f1:r2f2
./nbawk-current:
--> 
--> 
--> 
--> 
--> ./nbawk-current: out of memory in readrec 3
-->  input record number 4, file 
-->  source line number 1




And now, a comparison of gawk, nbawk-fixed-gawk, mawk, nbawk-fixed-mawk.
Of the 18 scenarios tested, disagreement arises in two cases, 004 and 012.


SUCCESS: 000: printf 1a2aa3aaa4 | "$awk" 1 RS=a         # Single character RS
SUCCESS: 001: printf 1a2aa3aaa4 | "$awk" 1 RS='[a]'             # Regular 
expression RS
SUCCESS: 002: printf 1a2ab3aba4 | "$awk" 1 RS='[ab]'
SUCCESS: 003: printf 1ab2ab3ab4 | "$awk" 1 RS=ab
FAILURE: 004: printf 1a2aa3aaa4 | "$awk" 1 RS='a*'
gawk:
--> 
--> 
--> 
--> 
--> 
--> 
./nbawk-fixed-gawk:
--> 1
--> 2
--> 3
--> 4
mawk:
--> 1
--> 2
--> 3
--> 4
./nbawk-fixed-mawk:
--> 1
--> 2
--> 3
--> 4
SUCCESS: 005: printf 1a2aa3aaa4 | "$awk" 1 RS='aa*'
SUCCESS: 006: printf 1a2ab3aab4 | "$awk" 1 RS='aa*b*'
SUCCESS: 007: printf 1abc | "$awk" 1 RS='abcde|b'
SUCCESS: 008: printf 1abcdf2 | "$awk" 1 RS='abcde|b'
SUCCESS: 009: printf 1abcdebf2 | "$awk" 1 RS='abcde|b'
SUCCESS: 010: printf 1abcdf2 | "$awk" 1 RS='abcde|a'
SUCCESS: 011: printf a1a2a3a | "$awk" 1 RS='^a'         # Anchors
FAILURE: 012: printf aaa1a2a | "$awk" 1 RS='^a'
gawk:
--> 
--> aa1a2a
./nbawk-fixed-gawk:
--> 
--> aa1a2a
mawk:
--> 
--> 
--> 
--> 1a2a
./nbawk-fixed-mawk:
--> 
--> 
--> 
--> 1a2a
SUCCESS: 013: printf a1a2a3a | "$awk" 1 RS='a$'
SUCCESS: 014: printf a1a2aaa | "$awk" 1 RS='a$'
SUCCESS: 015: jot -s a 10000 | "$awk" 'NR>1' RS='999[0-9]'      # Trigger 
adjbuf relocation
SUCCESS: 016: printf foo | "$awk" 1 RS=                 # Empty RS (multiline 
records)
SUCCESS: 017: printf '\n\n\nr1f1\nr1f2\n\nr2f1\nr2f2\n\n\n' | "$awk" '{$1=$1}1' 
RS= OFS=:


In 004, I don't know what's going on there with gawk. It emits the same number 
of empty lines as there are possible non-empty, shortest matches to the RS 
regular expression. Somehow, characters which cannot match RS and should appear 
in an output record are consumed. This gawk behavior appears to be useless and 
a bug. Neither diff behaves this way.

Case 012 is more interesting. Here we see that gawk and mawk behave differently 
when RS is anchored. RS=^a in gawk matches only once and only at the very 
beginning of a stream. In mawk, RS=^a can match repeated instances of "a" at 
the beginning of the stream, yielding an empty record for each match.

It seems gawk will only anchor at the start of a stream, while in mawk each 
record search starting point is anchorable. This difference is only evident 
when there are multiple consecutive instances of RS at the start of a stream. 
Otherwise, they behave identically.

The gawk behavior seems more sensible and less surprising. gawk is also the 
more widely-deployed implementation.

The mawk behavior has the advantage of being simpler to implement -- since the 
automaton's initial state does not need to be manipulated to enable/disable the 
hat (^) anchor.

It's likely that the decision of which implementation to emulate will prove to 
be of little importance, since the utility of a hat-anchored regular expression 
record separator is limited at best. Personally, I have a slight preference for 
gawk's behavior.

Before I conclude, I'd like to take a moment to thank Russ Cox. His exposition 
of how finite automatons are used to implement regular expressions in "Regular 
Expression Matching Can Be Simple And Fast" at 
http://swtch.com/~rsc/regexp/regexp1.html helped me, someone with no formal 
schooling in computer science, get through nawk/dist/b.c a lot faster than I 
otherwise would have. Thank you, Russ.

And, finally, my diffs and test script. Hopefully they'll be of some use. First 
the simpler mawk-like diff, then the gawk-like diff, and then the sh script and 
testcases.


Kind regards to all,
Miguel Piñeiro Jr.
>How-To-Repeat:
See description above.
>Fix:
MAWK-LIKE DIFF


Index: b.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/b.c,v
retrieving revision 1.2
diff -u -p -r1.2 b.c
--- b.c 26 Aug 2010 14:55:19 -0000      1.2
+++ b.c 4 Mar 2012 22:11:38 -0000
@@ -624,6 +624,96 @@ int nematch(fa *f, const char *p0) /* no
        return (0);
 }
 
+
+/*
+ * NAME
+ *     fnematch
+ *
+ * DESCRIPTION
+ *     A stream-fed version of nematch which transfers characters to a
+ *     null-terminated buffer. All characters up to and including the last
+ *     character of the matching text or EOF are placed in the buffer. If
+ *     a match is found, patbeg and patlen are set appropriately.
+ *
+ * RETURN VALUES
+ *     0    No match found.
+ *     1    Match found.
+ */  
+
+int fnematch(fa *pfa, FILE *f, uschar **pbuf, int *pbufsize, int quantum)      
+{
+       uschar *buf = *pbuf;
+       int bufsize = *pbufsize;
+       int c, i, j, k, ns, s;
+
+       s = pfa->initstat;
+       assert(s < pfa->state_count);
+       patlen = 0;
+
+       /*
+        * All indices relative to buf.
+        * i <= j <= k <= bufsize
+        *
+        * i: origin of active substring
+        * j: current character
+        * k: destination of next getc()
+        */
+       i = -1, k = 0;
+        do {
+               j = i++;
+               do {
+                       if (++j == k) {
+                               if (k == bufsize)
+                                       if (!adjbuf(&buf, &bufsize, bufsize+1, 
quantum, 0, "fnematch"))
+                                               FATAL("stream '%.30s...' too 
long", buf);       
+                               buf[k++] = (c = getc(f)) != EOF ? c : 0;
+                       }
+                       c = buf[j];
+                       /* assert(c < NCHARS); */
+
+                       if ((ns = pfa->gototab[s][c]) != 0)
+                               s = ns;
+                       else
+                               s = cgoto(pfa, s, c);
+                       assert(s < pfa->state_count);
+
+                       if (pfa->out[s]) {      /* final state */
+                               patlen = j - i + 1;
+                               if (c == 0)     /* don't count $ */
+                                       patlen--;
+                       }
+               } while (buf[j] && s != 1);
+               s = 2;
+       } while (buf[i] && !patlen);
+
+       /* adjbuf() may have relocated a resized buffer. Inform the world. */
+       *pbuf = buf;
+       *pbufsize = bufsize;
+
+       if (patlen) {
+               patbeg = buf + i;
+               /*
+                * Under no circumstances is the last character fed to
+                * the automaton part of the match. It is EOF's nullbyte,
+                * or it sent the automaton into a state with no further
+                * transitions available (s==1), or both. Room for a
+                * terminating nullbyte is guaranteed.
+                *
+                * ungetc any chars after the end of matching text
+                * (except for EOF's nullbyte, if present) and null
+                * terminate the buffer.
+                */
+               do
+                       if (buf[--k] && ungetc(buf[k], f) == EOF)
+                               FATAL("unable to ungetc '%c'", buf[k]); 
+               while (k > i + patlen);
+               buf[k] = 0;
+               return 1;
+       }
+       else
+               return 0;
+}
+
 Node *reparse(const char *p)   /* parses regular expression pointed to by p */
 {                      /* uses relex() to scan regular expression */
        Node *np;
Index: lib.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/lib.c,v
retrieving revision 1.4
diff -u -p -r1.4 lib.c
--- lib.c       20 Jan 2011 21:23:11 -0000      1.4
+++ lib.c       4 Mar 2012 22:11:38 -0000
@@ -188,7 +188,7 @@ void nextfile(void)
 
 int readrec(uschar **pbuf, int *pbufsize, FILE *inf)   /* read one record into 
buf */
 {
-       int sep, c;
+       int sep, c, isrec;
        uschar *rr, *buf = *pbuf;
        int bufsize = *pbufsize;
        size_t len;
@@ -202,48 +202,18 @@ int readrec(uschar **pbuf, int *pbufsize
                        FATAL("field separator %.10s... is too long", *FS);
                memcpy(inputFS, *FS, len_inputFS);
        }
-       if ((sep = **RS) == 0) {
-               sep = '\n';
-               while ((c=getc(inf)) == '\n' && c != EOF)       /* skip leading 
\n's */
-                       ;
-               if (c != EOF)
-                       ungetc(c, inf);
-       } else if ((*RS)[1]) {
+       if (**RS && (*RS)[1]) {
                fa *pfa = makedfa(*RS, 1);
-               int tempstat = pfa->initstat;
-               char *brr = buf;
-               char *rrr = NULL;
-               int x;
-               for (rr = buf; ; ) {
-                       while ((c = getc(inf)) != EOF) {
-                               if (rr-buf+3 > bufsize)
-                                       if (!adjbuf(&buf, &bufsize, 3+rr-buf,
-                                           recsize, &rr, "readrec 2"))
-                                               FATAL("input record `%.30s...'"
-                                                   " too long", buf);
-                               *rr++ = c;
-                               *rr = '\0';
-                               if (!(x = nematch(pfa, brr))) {
-                                       pfa->initstat = tempstat;
-                                       if (rrr) {
-                                               rr = rrr;
-                                               ungetc(c, inf);
-                                               break;
-                                       }
-                               } else {
-                                       pfa->initstat = 2;
-                                       brr = rrr = rr = patbeg;
-                               }
-                       }
-                       if (rrr || c == EOF)
-                               break;
-                       if ((c = getc(inf)) == '\n' || c == EOF)
-                               /* 2 in a row */
-                               break;
-                       *rr++ = '\n';
-                       *rr++ = c;
-               }
+               if (fnematch(pfa, inf, &buf, &bufsize, recsize))
+                       *patbeg = 0;
        } else {
+               if ((sep = **RS) == 0) {
+                       sep = '\n';
+                       while ((c=getc(inf)) == '\n' && c != EOF)       /* skip 
leading \n's */
+                               ;
+                       if (c != EOF)
+                               ungetc(c, inf);
+               }
                for (rr = buf; ; ) {
                        for (; (c=getc(inf)) != sep && c != EOF; ) {
                                if (rr-buf+1 > bufsize)
@@ -264,14 +234,15 @@ int readrec(uschar **pbuf, int *pbufsize
                        *rr++ = '\n';
                        *rr++ = c;
                }
+               if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 
3"))
+                       FATAL("input record `%.30s...' too long", buf);
+               *rr = 0;
        }
-       if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 3"))
-               FATAL("input record `%.30s...' too long", buf);
-       *rr = 0;
-          dprintf( ("readrec saw <%s>, returns %d\n", buf, c == EOF && rr == 
buf ? 0 : 1) );
        *pbuf = buf;
        *pbufsize = bufsize;
-       return c == EOF && rr == buf ? 0 : 1;
+       isrec = *buf || !feof(inf);
+          dprintf( ("readrec saw <%s>, returns %d\n", buf, isrec) );
+       return isrec;
 }
 
 char *getargv(int n)   /* get ARGV[n] */
Index: proto.h
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/proto.h,v
retrieving revision 1.5
diff -u -p -r1.5 proto.h
--- proto.h     16 Sep 2011 16:09:46 -0000      1.5
+++ proto.h     4 Mar 2012 22:11:38 -0000
@@ -54,6 +54,7 @@ extern        int     member(int, const char *);
 extern int     match(fa *, const char *);
 extern int     pmatch(fa *, const char *);
 extern int     nematch(fa *, const char *);
+extern int     fnematch(fa *, FILE *, uschar **, int *, int);
 extern Node    *reparse(const char *);
 extern Node    *regexp(void);
 extern Node    *primary(void);




GAWK-LIKE DIFF


Index: b.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/b.c,v
retrieving revision 1.2
diff -u -p -r1.2 b.c
--- b.c 26 Aug 2010 14:55:19 -0000      1.2
+++ b.c 6 Mar 2012 17:36:15 -0000
@@ -624,6 +624,96 @@ int nematch(fa *f, const char *p0) /* no
        return (0);
 }
 
+
+/*
+ * NAME
+ *     fnematch
+ *
+ * DESCRIPTION
+ *     A stream-fed version of nematch which transfers characters to a
+ *     null-terminated buffer. All characters up to and including the last
+ *     character of the matching text or EOF are placed in the buffer. If
+ *     a match is found, patbeg and patlen are set appropriately.
+ *
+ * RETURN VALUES
+ *     0    No match found.
+ *     1    Match found.
+ */  
+
+int fnematch(fa *pfa, FILE *f, uschar **pbuf, int *pbufsize, int quantum)      
+{
+       uschar *buf = *pbuf;
+       int bufsize = *pbufsize;
+       int c, i, j, k, ns, s;
+
+       s = pfa->initstat;
+       assert(s < pfa->state_count);
+       patlen = 0;
+
+       /*
+        * All indices relative to buf.
+        * i <= j <= k <= bufsize
+        *
+        * i: origin of active substring
+        * j: current character
+        * k: destination of next getc()
+        */
+       i = -1, k = 0;
+        do {
+               j = i++;
+               do {
+                       if (++j == k) {
+                               if (k == bufsize)
+                                       if (!adjbuf(&buf, &bufsize, bufsize+1, 
quantum, 0, "fnematch"))
+                                               FATAL("stream '%.30s...' too 
long", buf);       
+                               buf[k++] = (c = getc(f)) != EOF ? c : 0;
+                       }
+                       c = buf[j];
+                       /* assert(c < NCHARS); */
+
+                       if ((ns = pfa->gototab[s][c]) != 0)
+                               s = ns;
+                       else
+                               s = cgoto(pfa, s, c);
+                       assert(s < pfa->state_count);
+
+                       if (pfa->out[s]) {      /* final state */
+                               patlen = j - i + 1;
+                               if (c == 0)     /* don't count $ */
+                                       patlen--;
+                       }
+               } while (buf[j] && s != 1);
+               s = 2;
+       } while (buf[i] && !patlen);
+
+       /* adjbuf() may have relocated a resized buffer. Inform the world. */
+       *pbuf = buf;
+       *pbufsize = bufsize;
+
+       if (patlen) {
+               patbeg = buf + i;
+               /*
+                * Under no circumstances is the last character fed to
+                * the automaton part of the match. It is EOF's nullbyte,
+                * or it sent the automaton into a state with no further
+                * transitions available (s==1), or both. Room for a
+                * terminating nullbyte is guaranteed.
+                *
+                * ungetc any chars after the end of matching text
+                * (except for EOF's nullbyte, if present) and null
+                * terminate the buffer.
+                */
+               do
+                       if (buf[--k] && ungetc(buf[k], f) == EOF)
+                               FATAL("unable to ungetc '%c'", buf[k]); 
+               while (k > i + patlen);
+               buf[k] = 0;
+               return 1;
+       }
+       else
+               return 0;
+}
+
 Node *reparse(const char *p)   /* parses regular expression pointed to by p */
 {                      /* uses relex() to scan regular expression */
        Node *np;
Index: lib.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/lib.c,v
retrieving revision 1.4
diff -u -p -r1.4 lib.c
--- lib.c       20 Jan 2011 21:23:11 -0000      1.4
+++ lib.c       6 Mar 2012 17:36:15 -0000
@@ -38,6 +38,7 @@ THIS SOFTWARE.
 
 char   EMPTY[] = { '\0' };
 FILE   *infile = NULL;
+int    innew;          /* 1 = infile has not been read by readrec */
 char   *file   = EMPTY;
 uschar *record;
 int    recsize = RECSIZE;
@@ -104,6 +105,7 @@ void initgetrec(void)
                argno++;
        }
        infile = stdin;         /* no filenames, so use stdin */
+       innew = 1;
 }
 
 static int firsttime = 1;
@@ -146,9 +148,12 @@ int getrec(uschar **pbuf, int *pbufsize,
                                infile = stdin;
                        else if ((infile = fopen(file, "r")) == NULL)
                                FATAL("can't open file %s", file);
+                       innew = 1;
                        setfval(fnrloc, 0.0);
                }
-               c = readrec(&buf, &bufsize, infile);
+               c = readrec(&buf, &bufsize, infile, innew);
+               if (innew)
+                       innew = 0;
                if (c != 0 || buf[0] != '\0') { /* normal record */
                        if (isrecord) {
                                if (freeable(fldtab[0]))
@@ -186,9 +191,9 @@ void nextfile(void)
        argno++;
 }
 
-int readrec(uschar **pbuf, int *pbufsize, FILE *inf)   /* read one record into 
buf */
+int readrec(uschar **pbuf, int *pbufsize, FILE *inf, int newflag)      /* read 
one record into buf */
 {
-       int sep, c;
+       int sep, c, isrec, found, tempstat;
        uschar *rr, *buf = *pbuf;
        int bufsize = *pbufsize;
        size_t len;
@@ -202,48 +207,26 @@ int readrec(uschar **pbuf, int *pbufsize
                        FATAL("field separator %.10s... is too long", *FS);
                memcpy(inputFS, *FS, len_inputFS);
        }
-       if ((sep = **RS) == 0) {
-               sep = '\n';
-               while ((c=getc(inf)) == '\n' && c != EOF)       /* skip leading 
\n's */
-                       ;
-               if (c != EOF)
-                       ungetc(c, inf);
-       } else if ((*RS)[1]) {
+       if (**RS && (*RS)[1]) {
                fa *pfa = makedfa(*RS, 1);
-               int tempstat = pfa->initstat;
-               char *brr = buf;
-               char *rrr = NULL;
-               int x;
-               for (rr = buf; ; ) {
-                       while ((c = getc(inf)) != EOF) {
-                               if (rr-buf+3 > bufsize)
-                                       if (!adjbuf(&buf, &bufsize, 3+rr-buf,
-                                           recsize, &rr, "readrec 2"))
-                                               FATAL("input record `%.30s...'"
-                                                   " too long", buf);
-                               *rr++ = c;
-                               *rr = '\0';
-                               if (!(x = nematch(pfa, brr))) {
-                                       pfa->initstat = tempstat;
-                                       if (rrr) {
-                                               rr = rrr;
-                                               ungetc(c, inf);
-                                               break;
-                                       }
-                               } else {
-                                       pfa->initstat = 2;
-                                       brr = rrr = rr = patbeg;
-                               }
-                       }
-                       if (rrr || c == EOF)
-                               break;
-                       if ((c = getc(inf)) == '\n' || c == EOF)
-                               /* 2 in a row */
-                               break;
-                       *rr++ = '\n';
-                       *rr++ = c;
+               if (newflag)
+                       found = fnematch(pfa, inf, &buf, &bufsize, recsize);
+               else {
+                       tempstat = pfa->initstat;
+                       pfa->initstat = 2;
+                       found = fnematch(pfa, inf, &buf, &bufsize, recsize);
+                       pfa->initstat = tempstat;
                }
+               if (found)
+                       *patbeg = 0;
        } else {
+               if ((sep = **RS) == 0) {
+                       sep = '\n';
+                       while ((c=getc(inf)) == '\n' && c != EOF)       /* skip 
leading \n's */
+                               ;
+                       if (c != EOF)
+                               ungetc(c, inf);
+               }
                for (rr = buf; ; ) {
                        for (; (c=getc(inf)) != sep && c != EOF; ) {
                                if (rr-buf+1 > bufsize)
@@ -264,14 +247,15 @@ int readrec(uschar **pbuf, int *pbufsize
                        *rr++ = '\n';
                        *rr++ = c;
                }
+               if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 
3"))
+                       FATAL("input record `%.30s...' too long", buf);
+               *rr = 0;
        }
-       if (!adjbuf(&buf, &bufsize, 1+rr-buf, recsize, &rr, "readrec 3"))
-               FATAL("input record `%.30s...' too long", buf);
-       *rr = 0;
-          dprintf( ("readrec saw <%s>, returns %d\n", buf, c == EOF && rr == 
buf ? 0 : 1) );
        *pbuf = buf;
        *pbufsize = bufsize;
-       return c == EOF && rr == buf ? 0 : 1;
+       isrec = *buf || !feof(inf);
+          dprintf( ("readrec saw <%s>, returns %d\n", buf, isrec) );
+       return isrec;
 }
 
 char *getargv(int n)   /* get ARGV[n] */
Index: proto.h
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/proto.h,v
retrieving revision 1.5
diff -u -p -r1.5 proto.h
--- proto.h     16 Sep 2011 16:09:46 -0000      1.5
+++ proto.h     6 Mar 2012 17:36:15 -0000
@@ -54,6 +54,7 @@ extern        int     member(int, const char *);
 extern int     match(fa *, const char *);
 extern int     pmatch(fa *, const char *);
 extern int     nematch(fa *, const char *);
+extern int     fnematch(fa *, FILE *, uschar **, int *, int);
 extern Node    *reparse(const char *);
 extern Node    *regexp(void);
 extern Node    *primary(void);
@@ -122,7 +123,7 @@ extern      void    makefields(int, int);
 extern void    growfldtab(int n);
 extern int     getrec(uschar **, int *, int);
 extern void    nextfile(void);
-extern int     readrec(uschar **buf, int *bufsize, FILE *inf);
+extern int     readrec(uschar **buf, int *bufsize, FILE *inf, int newflag);
 extern char    *getargv(int);
 extern void    setclvar(char *);
 extern void    fldbld(void);
@@ -191,7 +192,7 @@ extern      Cell    *bltin(Node **, int);
 extern Cell    *printstat(Node **, int);
 extern Cell    *nullproc(Node **, int);
 extern FILE    *redirect(int, Node *);
-extern FILE    *openfile(int, const char *);
+extern FILE    *openfile(int, const char *, int *);
 extern const char      *filename(FILE *);
 extern Cell    *closefile(Node **, int);
 extern void    closeall(void);
Index: run.c
===================================================================
RCS file: /cvsroot/src/external/historical/nawk/dist/run.c,v
retrieving revision 1.4
diff -u -p -r1.4 run.c
--- run.c       22 Nov 2011 22:30:22 -0000      1.4
+++ run.c       6 Mar 2012 17:36:15 -0000
@@ -406,7 +406,7 @@ Cell *awkgetline(Node **a, int n)   /* get
        FILE *fp;
        uschar *buf;
        int bufsize = recsize;
-       int mode;
+       int mode, newflag;
 
        if ((buf = malloc(bufsize)) == NULL)
                FATAL("out of memory in getline");
@@ -418,12 +418,12 @@ Cell *awkgetline(Node **a, int n) /* get
                mode = ptoi(a[1]);
                if (mode == '|')                /* input pipe */
                        mode = LE;      /* arbitrary flag */
-               fp = openfile(mode, getsval(x));
+               fp = openfile(mode, getsval(x), &newflag);
                tempfree(x);
                if (fp == NULL)
                        n = -1;
                else
-                       n = readrec(&buf, &bufsize, fp);
+                       n = readrec(&buf, &bufsize, fp, newflag);
                if (n <= 0) {
                        ;
                } else if (a[0] != NULL) {      /* getline var <file */
@@ -1623,7 +1623,7 @@ Cell *bltin(Node **a, int n)      /* builtin 
                if (isrec(x) || strlen(getsval(x)) == 0) {
                        flush_all();    /* fflush() or fflush("") -> all */
                        u = 0;
-               } else if ((fp = openfile(FFLUSH, getsval(x))) == NULL)
+               } else if ((fp = openfile(FFLUSH, getsval(x), NULL)) == NULL)
                        u = -1;
                else
                        u = fflush(fp);
@@ -1715,7 +1715,7 @@ FILE *redirect(int a, Node *b)    /* set up
 
        x = execute(b);
        fname = getsval(x);
-       fp = openfile(a, fname);
+       fp = openfile(a, fname, NULL);
        if (fp == NULL)
                FATAL("can't open file %s", fname);
        tempfree(x);
@@ -1746,7 +1746,7 @@ void stdinit(void)        /* in case stdin, etc
        files[2].mode = GT;
 }
 
-FILE *openfile(int a, const char *us)
+FILE *openfile(int a, const char *us, int *pnewflag)
 {
        const char *s = us;
        size_t i;
@@ -1756,11 +1756,12 @@ FILE *openfile(int a, const char *us)
        if (*s == '\0')
                FATAL("null file name in print or getline");
        for (i = 0; i < nfiles; i++)
-               if (files[i].fname && strcmp(s, files[i].fname) == 0) {
-                       if (a == files[i].mode || (a==APPEND && 
files[i].mode==GT))
-                               return files[i].fp;
-                       if (a == FFLUSH)
-                               return files[i].fp;
+               if (files[i].fname && strcmp(s, files[i].fname) == 0 &&
+                   (a == files[i].mode || (a==APPEND && files[i].mode==GT) ||
+                    a == FFLUSH)) {
+                       if (pnewflag)
+                               *pnewflag = 0;
+                       return files[i].fp;
                }
        if (a == FFLUSH)        /* didn't find it, so don't create it! */
                return NULL;
@@ -1797,6 +1798,8 @@ FILE *openfile(int a, const char *us)
                files[i].fname = tostring(s);
                files[i].fp = fp;
                files[i].mode = m;
+               if (pnewflag)
+                       *pnewflag = 1;
        }
        return fp;
 }




====================================


The testscript takes any number of awk implementations as arguments and reads 
test cases (pipelines to eval) from standard input. For example:
./awktest.sh gawk mawk nawk < testcases > results 2>&1

$ cat awktest.sh
#/bin/sh

if [ $# -eq 0 ]; then
        printf '%s\n' 'Where'\''s the AWK?' 1>&2
        exit 2
fi

# Pair each AWK implementation with a tempfile
for awk; do
        f=$(mktemp -t awktests)
        if [ $? -ne 0 ]; then 
                rm $flist 2> /dev/null
                exit 2
        fi
        flist="$flist $f"
        set "$@" "$awk:$f"
        shift
done

e=0     # Exit status.
i=0     # Test case loop index
# CAUTION: eval ahead.
while read -r testcase; do
        oldf=
        result=SUCCESS
        for awk; do
                f=${awk##*:}
                awk=${awk%:*}
                eval "$testcase" > $f 2>&1 || result=FAILURE
                if [ "$oldf" ]; then
                        cmp -s $oldf $f || result=FAILURE
                fi
                oldf=$f
        done
        printf '%s: %s: %s\n' $result $(printf %03d $i) "$testcase"
        if [ $result = FAILURE ]; then
                e=1
                for awk; do
                        f=${awk##*:}
                        awk=${awk%:*}
                        printf '%s:\n' "$awk"
                        sed 's/^/--> /' $f
                done
        fi
        i=$((i+1))
done

rm $flist 2> /dev/null
exit $e


$ cat testcases
printf 1a2aa3aaa4 | "$awk" 1 RS=a               # Single character RS
printf 1a2aa3aaa4 | "$awk" 1 RS='[a]'           # Regular expression RS
printf 1a2ab3aba4 | "$awk" 1 RS='[ab]'
printf 1ab2ab3ab4 | "$awk" 1 RS=ab
printf 1a2aa3aaa4 | "$awk" 1 RS='a*'
printf 1a2aa3aaa4 | "$awk" 1 RS='aa*'
printf 1a2ab3aab4 | "$awk" 1 RS='aa*b*'
printf 1abc | "$awk" 1 RS='abcde|b'
printf 1abcdf2 | "$awk" 1 RS='abcde|b'
printf 1abcdebf2 | "$awk" 1 RS='abcde|b'
printf 1abcdf2 | "$awk" 1 RS='abcde|a'
printf a1a2a3a | "$awk" 1 RS='^a'               # Anchors
printf aaa1a2a | "$awk" 1 RS='^a'
printf a1a2a3a | "$awk" 1 RS='a$'
printf a1a2aaa | "$awk" 1 RS='a$'
jot -s a 10000 | "$awk" 'NR>1' RS='999[0-9]'    # Trigger adjbuf relocation
printf foo | "$awk" 1 RS=                       # Empty RS (multiline records)
printf '\n\n\nr1f1\nr1f2\n\nr2f1\nr2f2\n\n\n' | "$awk" '{$1=$1}1' RS= OFS=:



Home | Main Index | Thread Index | Old Index