tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/39002: harmful AWK extension: non-portable escaped character



I think it may still be important here to point out that AWK has separate syntax for expressing regular expressions and strings because of this very issue of what backslashes represent in each syntax. This should be abundantly clear to any C programmer who has had occasion to represent REs in C strings (or any other of the many languages offering only C-like strings and no separate RE syntax), or to those who have used something like lex which also has separate syntax for regular expressions and strings.

As the awk(1) manual says:

String constants are quoted " ", with the usual C escapes recog-
     nized within.

and:

         / re / is a constant regular expression;
any string (constant or variable) may be used as a regular expression, except in the position of an isolated regular expression in a pattern.

Perhaps if the manual also explicitly warned that expressing an RE as a string required extra escaping of all backslashes (instead of relying on the reader's experience with C and/or shell (command-line) strings) then this issue would, eventually, go _quietly_ away.

I note the mawk manual does say explicitly (w.r.t. string constant syntax):

If you escape any other character \c, you get \c, i.e., mawk ignores
       the escape.

and like the AWK manual it also declares that RE syntax is separate from string syntax by saying:

        Regular expressions are enclosed in slashes,

and finally it says:

Any expression can be used on the right hand side of the ~ or ! ~ opera- tors or passed to a built-in that expects a regular expression. If needed, it is converted to string, and then interpreted as a regular
       expression.

which in a round-about way also says what I've said above, at least to anyone cognizant of the differences between strings and REs, i.e. that care will have to be taken to properly represent backslashes and such in strings that will be interpreted as regular expressions.


I believe the mistake that triggered all of this was in assuming that "gawk" can be used as an interpreter for a portable AWK language script. It cannot. GAWK in its native mode is not AWK compatible. GAWK has this glaring difference:

The escape sequences may also be used inside constant regular expres-
       sions (e.g., /[ \t\f\n\r\v]/ matches whitespace characters).

In true AWK regular expressions are pure ("a `\' followed by any other character (matching that character taken as an ordinary character, as if the `\' had not been present)") and they are not cross-contaminated by C-like syntax in the way that GAWK's are.

I don't know if GAWK's so-called "compatibility" mode corrects this difference or not.

I think the GAWK people had far too much influence on the POSIX AWK standardization, perhaps sadly because GAWK was one of the only contending alternative (and open) implementations at the time the standard was written. Perhaps this ambiguity in the POSIX AWK standard was also due to the lack of an earlier firm RE standard which GAWK could have adhered to and which POSIX AWK could have referenced, i.e. one which would have disallowed C-like character escapes in pure REs. GAWK is certainly the odd one out here and now.

On 24-Jun-08, at 10:32 AM, Valeriy E. Ushakov wrote:
After successfully alienating and antagonizing your audience, don't be
surprised people are not interested in hearing whatever rational
argument you might actually have there.


Thanks!  :-)

--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>



Home | Main Index | Thread Index | Old Index