tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: shell (/bin/sh) pattern matching bugs

    Date:        Mon, 25 Jun 2018 17:30:25 +0200
    Message-ID:  <>

  | About the POSIX description "2.13.1 Patterns Matching a Single
  | Character", have the draft assigned a precedence between "XBD RE Bracket
  | Expression" and the shell behavior, i.e. "XBD RE Bracket"
  | taking precedence over the shell behavior?

There is some language, and more proposed to be added, which is designed
to try and make the 3 user (regular expressions, sh glob, and fnmatch(3))
(which are all different...) be able to share the same specification, and be

As it is currently, it fails.   Fixing it is still a work in progress, unless 
it becomes possible to convince the people who really count to simply
split the specifiction, and describe each of them separately, then it is
ging to be difficult to get it right, IMO.

Do remember that the aim of  POSIX is to specify what exists, and works
at least in some systems (what works in NetBSD does not have all that
much influence....)   That is, POSIX is not a legislature, their job is not to 
tell us what we must do, but to document what works so the users can
write portable code.   Of course, they don't cater to every weirdness in
every system, so sometimes something is spefied which is simply different
than we're used to.   Usually when that happens we just adapt - other
systems do it the other way, we want to be as copatible as possible (less
local  patches) so ...  and sometimes we can convince them thet they are
specifying sub-optimal behaviour, and even if they can't simply specify the
better way, they can at least allow it to co-exist.

  | Because one might interpret the reference to "XBD RE Bracket" as voiding
  | the quoting dance inside a bracket expression,

It isn't intended to do that.   The current new proposed text tries to fix it,
but it is not right yet.

  | since in a RE the special characters loose their special meaning,

They do in glob and fnmatch too - but different special characters
appear, but there is no question but that [?] is a glob expression
(RE too, but there it is not a surprise) that matches a '?' and nothing else
just as [.] is an RE (and glob) expression that matches a period.
Neither even match any character.

  | and one could argue, it seems,
  | that this is the case for the double quotes too?

Yes, it would be, if you got the double quotes that far.   It all gets very
messy, as in something like ["${var}"] the quotes are not really
literal characters, they're recognised by the lexer, and used to
quote the characters in the expansion of ${var} ... POSIX actually
specifies that the quotes remain as is, until they are removed later,
but I think that's a bug - it is a way that appears nice to specify
how quoting works, but no shell I'm aware of actually implements it
like that, and it causes all kinds of weird (and incorrect) corner
cases, like the one you mentioned.

But if that were true, it would mean that

	case '"' in ["${var}"])  echo match;; esac

would echo "match" and I doubt anyone expects (and nothing
implements) that.

Handling that kind of thing is why the first attempted fix simply
specified that quote removal happen on case patterns.   But
that broke all kind of other things, so it became more complex
to try and fix them ...     But then that ignored dealing with things
	ls '*'*
to list all files with names starting with an asterisk.   The spec is
quite clear that filename expansion happens before quote
removal, so the quotes are still there when the pattern is
matches against the filenames.    Just as they are in the
case above.    In this case it is easy to specify, as the glob
stuff that is not [ ] is quite different than REs, so it is all
specified separately, and it can just say "unquoted" and
stuff, and make it clear that the '*' just means a literal *.

But with the way it is specified for [] using the RE spec for
glob expressions, it all gets ugly.

  | "When pattern matching is
  | used where shell quote removal is not performed..." 

Yes, that is part of the attempt to make the same test work for
all uses, it is not great...

I really would not waste too much time on this - everyone agrees that
what is in the currently publlished POSIX spec in this area is
incorrect.   The only debate relates to just what is the right way
to fix it, which will actually describe the way things work (which
actually, for something so badly specified, is fairly consistent amongst

  |  the POSIX wording,
  | as you have already explained, should be adjusted to the de facto
  | uses and not the reverse,

Yes, it should, and if we can work out how to do that in a way
everyone was happy with, we would.

Note that most of this is actually not all that hard (other than trying
to merge it all into the RE description) - but when you get to things

	case "$x" in ${var}) ...

it starts getting very messy indeed.   If the ${var} there had been
quoted (as it was in the example that started all this) it is all easy -
the chars in the expansion are just literal chars, and that expression
would (with the quotes which are not there above) match if x is equal
to var (that is x='\??').

But as it is written, what it means is less clear, though the most
common view is that the pattern ($var) matches a question mark
followed by any other character (that is, the \  quotes the first ?
and the second one is the meta-char).   However, it is also clear
that only \ works that way, if we had var='"?"?' then it would match
a string starting with a double quote, then any single char, then
another double quote, then any char.   So something like "a"b
On the other hand, if ghgiven literally
	case "$x" in "?"?) ...
then it matches a 2 char sequence starting with a question mark.

Sh syntax is just wonderful...   (in the original sense, full of wonder,
in that you stare at it, and wonder how did it get like that?).


Home | Main Index | Thread Index | Old Index