tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/39002: harmful AWK extension: non-portable escaped character




On 12-Jul-08, at 5:14 PM, David Holland wrote:

On Thu, Jun 26, 2008 at 10:52:57AM -0400, Greg A. Woods; Planix, Inc. wrote:
It's not clearly defined there at all.

...which is why it ought to generate a warning.

No, I don't think so.

As others have shown the standard (POSIX) does not define the behaviour of
backslashes in a string constant in AWK.

However as I've shown the history of AWK in the context of UNIX not only clearly defines the purpose and meaning of backslashes in a string constant (and separately in regular expressions), but a rationale is also plainly evident for the way these things have always worked the way they do in all
but one(?) "rogue"(*) implementation.

I'm not clear on what rationale you're thinking of. If someone writes
the string constant "^.*\.txt$", it's evident upon inspection by a
human that they intended the \. to escape the regexp metacharacter,
that is, they meant to write "^.*\\.txt$".

Let us begin again.

If someone stores a regular expression in an AWK string (and then uses that string in a place where it is allowed as a replacement for a regular expression in AWK) then that person needs to know that AWK has different syntax for strings and for REs and they must take the differences in syntax into account.

This is not an uncommon issue in several other scripting languages either, especially of course those that make good use of regular expressions.

In fact one could say that it is a design feature of AWK and those other (usually scripting and embedded) languages.

A consistent syntax for AWK strings and AWK regular expressions is simply not possible since the definitions of what each escaped character means differs (except perhaps in the obvious cases where the meaning is identical, such as '\\', '\b', '\f', '\n', '\r', and '\ddd' where 'ddd' is one to three digits between 0 and 7). Regular expression constants have additional special metacharacters that string constants do not.

Therefore as I said the rationale should be obvious. Within the context and heritage of AWK's primary environment it is natural and expected that AWK strings have the same syntax as C strings and that AWK REs have the same syntax as in grep/sed/ed et al. This is because expressing REs using the string constant syntax is tedious and contrary to the syntax used in those other RE-rich tools such as ed, sed, grep, etc. Since it is on occasion useful to manipulate REs within a program and to use the results in RE expressions and for RE operands it is necessary to transform the RE syntax into the syntax of a string. Programmers making such use of strings as REs should be very aware of this necessity.

Unfortunately the AWK manual does not define the meaning of "the usual C escapes" for strings, though the AWK book (the manual does clearly say this AWK we use in NetBSD is an implementation of the language described in the AWK book) does clearly define '\c' when 'c' is any other character (i.e. not one already given meaning in the tale on page 191) means simply that character literally (i.e. this implementation clearly defines the meaning of '\.' in a C string even though POSIX allows that it is "undefined"). The book is a little less concise in the definition of escape sequences in regular expressions, however in the end it is still abundantly clear that '\c' when 'c' is any "other" character is also just that literal character 'c' in any regular expression.

So for this implementation which defines the undefined behaviours w.r.t. escape sequences in the standard, there are no simple cases where anything dumber than a human can, at the immediate parser step, give a valid warning about possible misuse of an escape sequence in a string constant.

Finally note that should an AWK implementation choose to try to go beyond use of POSIX Extended Regular Expressions for its REs then there will be further clashes with the use of C escape sequences in the current RE syntax. I.e. for example with full Perl-compatible REs, eg. as implemented by the PCRE package, the '\b' escape sequence changes from matching a backspace character to matching a word boundary.

What I don't understand is why you think it's desirable to assume the
opposite meaning, which is clearly not what anyone intends or wants.

The intention of the programmer cannot be determined by a simple parser. Analysis in depth of how each string constant is eventually used would be necessary to intuit the programmer's intent. Without doing such analysis it is impossible to automatically provide helpful warnings to the programmer.

The only enhancement I can envision that could "fix" this issue for some people would be to allow string variables to be assigned from RE constants, e.g.:

        str = /^.*\.txt$/;

and thus the conversion of RE escape sequences to string escape sequences could be done internally and silently.

This might make it more difficult for naive programmers to then perform transformations on such strings assigned REs in this way though since they might not expect the representation of the RE to have changed internally.

And of course as an enhancement to the current language it would not be portable until all implementations implemented it.

Perhaps if the warning were made significantly more intelligent then its warnings might prove to be useful, but only in that case. I would strongly
suggest that warnings MUST NOT be given for properly escaped regular
expressions which are expressed as string constants.

This paragraph does not make any sense.

I don't understand why it doesn't make sense to you. If warnings are to be generated where the AWK parser sees a string constant that it thinks might be used eventually as an RE then it must be smarter than the programmer and not produce noise for otherwise valid string syntax where the string is not ever used as an RE. Even something that might look like an RE in the context of a string constant may not ever be used as an RE during execution of the program.

--
                                        Greg A. Woods; Planix, Inc.
                                        <woods%planix.ca@localhost>



Home | Main Index | Thread Index | Old Index