tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: shell (/bin/sh) pattern matching bugs



    Date:        Sun, 24 Jun 2018 13:19:00 +0200
    From:        tlaronde%polynum.com@localhost
    Message-ID:  <20180624111859.GA681%polynum.com@localhost>

First, thanks for reading the message, and looking at the tests
and sending the comments/question - this is exactly the kind of
response I was hoping for.

Aside from fixing the NetBSD sh (correctly) this can also assist
in getting the POSIX spec done properly.    But in that regard, do
note that "properly" in this context is not necessarily "rational",
it has to take into account bugs (or design issues) in ancient
Bourne shells that have faithfully been copied into more modern
shells... (because no-one can know what scripts might be depending
upon the behaviour that has been implemented).

  | > [97] var="[:alpha:]"; case "[" in (["$var"]) printf M;; (*) printf X;; esac
  | > [97] Expected output 'M', received 'X'
  | > 
  |
  | Can you explain why you expect success ("M") in this case?

I can try...

  |  I expect:
  |
  | - Substitution of the value of $var in (["$var"]) resulting in:
  | 	(["[:alpha:]"]);

Yes.

  | - [Suppression of the double quotes?

This is, of course, the heart of the matter...

In POSIX, quote removal is explicitly not done on case
patterns. that is, the expansions that are done are listed,
and quote removal is not one of them.

So...

  | But this doesn't change anything in
  | the bracket expression];

It would, as, assuming the current literal text, an input string
which was a double quote (as in '"' or \") would match, as the
double quote character would appear in the [ ] expression
in the pattern.

Of course that is clearly absurd, and a bug report on the posix
text was submitted a while ago to include quote removal in the
list of operations to preform on case patterns.

Unfortunately, it isn't that simple, as just doing quote
removal on patterns would cause

	case x in ("*") echo match;; esac

to match as the quote removal would leave the
pattern being just an asterisk, which matches anything,
which is not what is supposed to happen.

So the current proposed new text (which had been
accepted, but now is being discussed again, and will
be changed) also specified that along with quote removal,
any "pattern magic" characters in the quoted part of the
pattern would be \ escaped so they remained literal,
so "quote removal" of the "*" would produce \* not *
and so the pattern matching would look for a literal
asterisk rather than anything - which is what is wanted.

But it turns out that this gets really messy, and makes
case pattern processing different from filename expansion
(glob) and variable expansion - as while those cases do
specify quote removal, it doesn't happen until after the
pattern is used, that is, in
	ls x"*"y
we want a listing of the file named x-asterisk-y not all
files with names starting x and ending y.

That means that one way or another, quoting needs to be
considered when matching patterns, and handled properly
rather than the quotes just removed.    How a shell chooses
to implement that is up to the shell of course.

Beyond that, when we get inside [ ] expressions, things get
even messier, as POSIX (mostly to save paper, I think...)
simply refers to the regular expression definition of how those
are processed (with the exception of substituting ! for sh
for the ^ in REs as the character to invert the match - because
^ was the "pipe" symbol in early shells - but that's not relevant
here).

But the effect of that way of doing the specification is that the \
which escapes magic characters in regular expressions does
not work inside [ ] (and the text is explicit about that - and correct)
which means the technique in the proposed revised posix text
about replacing " and ' quoting with \ doesn't work at all as intended
inside [ ] which is the case in test 97.   But that can't be right
either, as then

	case - in [a\-z]) ...

would not match, and whether you believe it should or not,
matching there is what all shells have done forever (that
is, the quoted - is a literal minus/hyphen/dash (whatever you
prefer to call it) and not the range indicator, where in a
regular expression that would be an 'a' and a range with
all chars from '\' to 'z').

  | But then "[" is not an alpha, so it correctly fails...
  |
  | Could you explain why you think otherwise?

The simple exlpanation of this is that because the '[' in
the '[:alpha:]' is quoted, it is not a '[:' character class
opening sequence, but a literal opening square bracket
followed by a colon, an 'a' an 'l' ... which means that we
have a bracket expression which includes a '[' character
as one of its members, and so the test '[' matches.

But as the long explanation above indicates, this is by no
means a clear cut case, and more discussion is a good
thing.   Given that we need to retain some compatibility with
shells of the past (and POSIX definitely wants that, and
except where POSIX is stupid, we want POSIX compat)
this is one of the issues that we want to work out what we
should do.

kre



Home | Main Index | Thread Index | Old Index