[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/39002: harmful AWK extension: non-portable escaped character
On 12-Jul-08, at 5:14 PM, David Holland wrote:
On Thu, Jun 26, 2008 at 10:52:57AM -0400, Greg A. Woods; Planix,
It's not clearly defined there at all.
...which is why it ought to generate a warning.
No, I don't think so.
As others have shown the standard (POSIX) does not define the
backslashes in a string constant in AWK.
However as I've shown the history of AWK in the context of UNIX not
clearly defines the purpose and meaning of backslashes in a string
(and separately in regular expressions), but a rationale is also
evident for the way these things have always worked the way they do
but one(?) "rogue"(*) implementation.
I'm not clear on what rationale you're thinking of. If someone writes
the string constant "^.*\.txt$", it's evident upon inspection by a
human that they intended the \. to escape the regexp metacharacter,
that is, they meant to write "^.*\\.txt$".
Let us begin again.
If someone stores a regular expression in an AWK string (and then uses
that string in a place where it is allowed as a replacement for a
regular expression in AWK) then that person needs to know that AWK has
different syntax for strings and for REs and they must take the
differences in syntax into account.
This is not an uncommon issue in several other scripting languages
either, especially of course those that make good use of regular
In fact one could say that it is a design feature of AWK and those
other (usually scripting and embedded) languages.
A consistent syntax for AWK strings and AWK regular expressions is
simply not possible since the definitions of what each escaped
character means differs (except perhaps in the obvious cases where the
meaning is identical, such as '\\', '\b', '\f', '\n', '\r', and '\ddd'
where 'ddd' is one to three digits between 0 and 7). Regular
expression constants have additional special metacharacters that
string constants do not.
Therefore as I said the rationale should be obvious. Within the
context and heritage of AWK's primary environment it is natural and
expected that AWK strings have the same syntax as C strings and that
AWK REs have the same syntax as in grep/sed/ed et al. This is because
expressing REs using the string constant syntax is tedious and
contrary to the syntax used in those other RE-rich tools such as ed,
sed, grep, etc. Since it is on occasion useful to manipulate REs
within a program and to use the results in RE expressions and for RE
operands it is necessary to transform the RE syntax into the syntax of
a string. Programmers making such use of strings as REs should be
very aware of this necessity.
Unfortunately the AWK manual does not define the meaning of "the usual
C escapes" for strings, though the AWK book (the manual does clearly
say this AWK we use in NetBSD is an implementation of the language
described in the AWK book) does clearly define '\c' when 'c' is any
other character (i.e. not one already given meaning in the tale on
page 191) means simply that character literally (i.e. this
implementation clearly defines the meaning of '\.' in a C string even
though POSIX allows that it is "undefined"). The book is a little
less concise in the definition of escape sequences in regular
expressions, however in the end it is still abundantly clear that '\c'
when 'c' is any "other" character is also just that literal character
'c' in any regular expression.
So for this implementation which defines the undefined behaviours
w.r.t. escape sequences in the standard, there are no simple cases
where anything dumber than a human can, at the immediate parser step,
give a valid warning about possible misuse of an escape sequence in a
Finally note that should an AWK implementation choose to try to go
beyond use of POSIX Extended Regular Expressions for its REs then
there will be further clashes with the use of C escape sequences in
the current RE syntax. I.e. for example with full Perl-compatible
REs, eg. as implemented by the PCRE package, the '\b' escape sequence
changes from matching a backspace character to matching a word boundary.
What I don't understand is why you think it's desirable to assume the
opposite meaning, which is clearly not what anyone intends or wants.
The intention of the programmer cannot be determined by a simple
parser. Analysis in depth of how each string constant is eventually
used would be necessary to intuit the programmer's intent. Without
doing such analysis it is impossible to automatically provide helpful
warnings to the programmer.
The only enhancement I can envision that could "fix" this issue for
some people would be to allow string variables to be assigned from RE
str = /^.*\.txt$/;
and thus the conversion of RE escape sequences to string escape
sequences could be done internally and silently.
This might make it more difficult for naive programmers to then
perform transformations on such strings assigned REs in this way
though since they might not expect the representation of the RE to
have changed internally.
And of course as an enhancement to the current language it would not
be portable until all implementations implemented it.
Perhaps if the warning were made significantly more intelligent
warnings might prove to be useful, but only in that case. I would
suggest that warnings MUST NOT be given for properly escaped regular
expressions which are expressed as string constants.
This paragraph does not make any sense.
I don't understand why it doesn't make sense to you. If warnings are
to be generated where the AWK parser sees a string constant that it
thinks might be used eventually as an RE then it must be smarter than
the programmer and not produce noise for otherwise valid string syntax
where the string is not ever used as an RE. Even something that might
look like an RE in the context of a string constant may not ever be
used as an RE during execution of the program.
Greg A. Woods; Planix, Inc.
Main Index |
Thread Index |