Subject: Re: shell quoting problems
To: David Laight <david@l8s.co.uk>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 11/26/2002 14:15:46
[ On Tuesday, November 26, 2002 at 18:11:18 (+0000), David Laight wrote: ]
> Subject: shell quoting problems
>
> Under 'command substitution' is:
> 
> "A single-quoted or double-quoted string that begins, but does not
> end, within the "`...`" sequence produces undefined results."
> 
> From which one could infer that in "`... "abc" ...`" the characters
> abc are a quoted string, rather than being outside the ones that
> contain the ` characters.

Well, yes, sort of.....  My understanding is that they are a quoted
string for the command expressed between the back-quotes.  The outer
double quotes are what make the output of the command into a quoted
string itself.

There is mention of some bogus back-quote handling of some older
implementations in the rationale section for quoting in IEEE P1003.2
Draft 11.2:

    Some systems have allowed the end of the word to terminate the backquoted
    command substitution, such as in

          "`echo hello"

    This usage is undefined in POSIX.2, where the matching backquote is
    required.  The other undefined usage can be illustrated by the example:

          sh -c '` echo "foo`'

    The description of the recursive actions involving command substitution
    can be illustrated with an example.  Upon recognizing the introduction of
    command substitution, the shell must parse input (in a new context),
    gathering the ``source'' for the command substitution until an unbalanced
    ) or ` is located.  For example, in the following

       echo "$(date; echo "
               one" )"

    the double-quote following the echo does not terminate the first double-
    quote; it is part of the command substitution ``script.''  Similarly, in

       echo "$(echo *)"

    the asterisk is not quoted since it is inside command substitution;
    however,

       echo "$(echo "*")"

    is quoted (and represents the asterisk character itself).


Here is the text from IEEE Std. 1003.1-2001, as presented in SuSv3,
which describes the shell command substitution rules:

    Command Substitution
    
   Command  substitution allows the output of a command to be substituted
   in  place of the command name itself. Command substitution shall occur
   when the command is enclosed as follows:
   
$(command)

   or (backquoted version):
   
`command`  

   The  shell  shall expand the command substitution by executing command
   in  a subshell environment (see [102]Shell Execution Environment ) and
   replacing  the  command  substitution  (the  text  of command plus the
   enclosing  "$()"  or  backquotes)  with  the  standard  output  of the
   command,  removing  sequences  of one or more <newline>s at the end of
   the  substitution.  Embedded  <newline>s  before the end of the output
   shall not be removed; however, they may be treated as field delimiters
   and  eliminated  during field splitting, depending on the value of IFS
   and quoting that is in effect.
   
   Within  the  backquoted style of command substitution, backslash shall
   retain  its  literal  meaning, except when followed by: '$' , '`' , or
   '\'  (dollar  sign, backquote, backslash). The search for the matching
   backquote  shall  be  satisfied by the first backquote found without a
   preceding backslash; during this search, if a non-escaped backquote is
   encountered  within  a  shell  comment,  a  here-document, an embedded
   command  substitution  of  the  $(  command) form, or a quoted string,
   undefined  results occur. A single-quoted or double-quoted string that
   begins,  but  does  not  end,  within  the  "`...`"  sequence produces
   undefined results.
   
   With   the  $(  command)  form,  all  characters  following  the  open
   parenthesis   to  the  matching  closing  parenthesis  constitute  the
   command.  Any  valid  shell  script  can be used for command, except a
   script  consisting  solely  of redirections which produces unspecified
   results.

   The results of command substitution shall not be processed for further
   tilde   expansion,   parameter  expansion,  command  substitution,  or
   arithmetic   expansion.   If  a  command  substitution  occurs  inside
   double-quotes,  it  shall  not  be  performed  on  the  results of the
   substitution.
   
   Command  substitution  can  be  nested.  To specify nesting within the
   backquoted version, the application shall precede the inner backquotes
   with backslashes, for example:
   
\`command\`

   If the command substitution consists of a single subshell, such as:
   
$( (command) )

   a  conforming  application  shall  separate  the "$(" and '(' into two
   tokens  (that is, separate them with white space). This is required to
   avoid any ambiguities with arithmetic expansion.




As for shell quoting, well, here is the authoritative text from IEEE
Std. 1003.1-2001, as presented in SuSv3 (which you should download for
yourself):

  Quoting
  
   Quoting is used to remove the special meaning of certain characters or
   words  to  the  shell.  Quoting  can  be  used to preserve the literal
   meaning  of  the  special  characters  in  the next paragraph, prevent
   reserved  words  from  being recognized as such, and prevent parameter
   expansion  and  command  substitution  within here-document processing
   (see [16]Here-Document ).
   
   The  application  shall  quote the following characters if they are to
   represent themselves:
   
|  &  ;  <  >  (  )  $  `  \  "  '  <space>  <tab>  <newline>

   and  the  following may need to be quoted under certain circumstances.
   That  is,  these  characters  may  be  special depending on conditions
   described elsewhere in this volume of IEEE Std 1003.1-2001:
   
*   ?   [   #   ~   =   %

   The   various   quoting   mechanisms   are   the   escape   character,
   single-quotes, and double-quotes. The here-document represents another
   form of quoting; see [17]Here-Document .
   
    Escape Character (Backslash)
    
   A backslash that is not quoted shall preserve the literal value of the
   following character, with the exception of a <newline>. If a <newline>
   follows  the  backslash,  the  shell  shall  interpret  this  as  line
   continuation.  The  backslash  and  <newline>s shall be removed before
   splitting  the  input  into  tokens.  Since  the  escaped <newline> is
   removed  entirely  from  the  input  and  is not replaced by any white
   space, it cannot serve as a token separator.

    Single-Quotes
    
   Enclosing  characters  in  single-quotes  (  ''  )  shall preserve the
   literal   value   of   each  character  within  the  single-quotes.  A
   single-quote cannot occur within single-quotes.
   
    Double-Quotes
    
   Enclosing  characters  in  double-quotes  (  ""  )  shall preserve the
   literal  value  of  all  characters within the double-quotes, with the
   exception  of the characters dollar sign, backquote, and backslash, as
   follows:
   $
          The  dollar  sign  shall retain its special meaning introducing
          parameter  expansion  (see [18]Parameter Expansion ), a form of
          command  substitution  (see  [19]Command  Substitution  ),  and
          arithmetic expansion (see [20]Arithmetic Expansion ).
          The  input  characters  within  the quoted string that are also
          enclosed  between  "$("  and  the  matching  ')'  shall  not be
          affected  by  the  double-quotes,  but rather shall define that
          command  whose  output  replaces  the "$(...)" when the word is
          expanded.  The  tokenizing rules in [21]Token Recognition , not
          including  the  alias substitutions in [22]Alias Substitution ,
          shall be applied recursively to find the matching ')' .
          Within  the  string  of characters from an enclosed "${" to the
          matching  '}'  ,  an  even number of unescaped double-quotes or
          single-quotes,  if  any,  shall  occur.  A  preceding backslash
          character  shall  be  used to escape a literal '{' or '}' . The
          rule  in [23]Parameter Expansion shall be used to determine the
          matching '}' .
   `
          The  backquote shall retain its special meaning introducing the
          other   form   of   command   substitution   (see   [24]Command
          Substitution  ).  The  portion  of  the  quoted string from the
          initial  backquote  and the characters up to the next backquote
          that  is  not preceded by a backslash, having escape characters
          removed,  defines  that  command  whose output replaces "`...`"
          when  the  word  is  expanded.  Either  of  the following cases
          produces undefined results:
          + A single-quoted or double-quoted string that begins, but does
            not end, within the "`...`" sequence
          + A  "`...`" sequence that begins, but does not end, within the
            same double-quoted string
   \
          The  backslash  shall  retain  its special meaning as an escape
          character  (see  [25]Escape  Character  (Backslash) ) only when
          followed  by  one  of  the following characters when considered
          special:
          
$   `   "   \   <newline>

   The  application  shall  ensure  that  a double-quote is preceded by a
   backslash  to  be included within double-quotes. The parameter '@' has
   special  meaning  inside double-quotes and is described in [26]Special
   Parameters .


  Token Recognition
  
   The  shell  shall read its input in terms of lines from a file, from a
   terminal  in the case of an interactive shell, or from a string in the
   case of [27]sh -c or [28]system(). The input lines can be of unlimited
   length.  These  lines  shall be parsed using two major modes: ordinary
   token recognition and processing of here-documents.
   
   When  an  io_here  token  has  been  recognized  by  the  grammar (see
   [29]Shell  Grammar  ), one or more of the subsequent lines immediately
   following  the  next  NEWLINE  token  form  the  body  of  one or more
   here-documents   and  shall  be  parsed  according  to  the  rules  of
   [30]Here-Document .  
    
   When  it is not processing an io_here, the shell shall break its input
   into  tokens  by  applying the first applicable rule below to the next
   character  in  its input. The token shall be from the current position
   in  the input until a token is delimited according to one of the rules
   below;  the  characters  forming  the  token  are exactly those in the
   input,  including  any  quoting  characters. If it is indicated that a
   token  is  delimited, and no characters have been included in a token,
   processing shall continue until an actual token is delimited.
    1. If  the  end  of  input  is recognized, the current token shall be
       delimited.   If  there  is  no  current  token,  the  end-of-input
       indicator shall be returned as the token.
    2. If  the previous character was used as part of an operator and the
       current  character  is not quoted and can be used with the current
       characters  to  form an operator, it shall be used as part of that
       (operator) token.
    3. If  the previous character was used as part of an operator and the
       current  character  cannot  be used with the current characters to
       form  an  operator, the operator containing the previous character
       shall be delimited.
    4. If   the   current   character   is  backslash,  single-quote,  or
       double-quote  (  '\' , '" , or ' )' and it is not quoted, it shall
       affect  quoting  for  subsequent  characters  up to the end of the
       quoted text. The rules for quoting are as described in [31]Quoting
       .  During  token  recognition  no  substitutions shall be actually
       performed,   and  the  result  token  shall  contain  exactly  the
       characters   that  appear  in  the  input  (except  for  <newline>
       joining),  unmodified,  including any embedded or enclosing quotes
       or  substitution  operators, between the quote mark and the end of
       the  quoted  text.  The token shall not be delimited by the end of
       the quoted field.
    5. If  the  current  character  is an unquoted '$' or '`' , the shell
       shall identify the start of any candidates for parameter expansion
       (  [32]Parameter  Expansion  ), command substitution ( [33]Command
       Substitution ), or arithmetic expansion ( [34]Arithmetic Expansion
       )  from  their  introductory  unquoted character sequences: '$' or
       "${"  ,  "$("  or  '`' , and "$((" , respectively. The shell shall
       read  sufficient  input  to  determine  the  end of the unit to be
       expanded  (as  explained  in the cited sections). While processing
       the  characters,  if  instances of expansions or quoting are found
       nested  within  the  substitution,  the  shell  shall  recursively
       process  them  in  the  manner specified for the construct that is
       found. The characters found from the beginning of the substitution
       to  its  end,  allowing  for  any recursion necessary to recognize
       embedded  constructs,  shall  be included unmodified in the result
       token,  including any embedded or enclosing substitution operators
       or  quotes.  The  token  shall  not be delimited by the end of the
       substitution.
    6. If  the  current  character  is  not quoted and can be used as the
       first  character  of  a  new  operator, the current token (if any)
       shall  be  delimited.  The  current character shall be used as the
       beginning of the next (operator) token.
    7. If  the  current  character  is an unquoted <newline>, the current
       token shall be delimited.
    8. If  the  current  character  is  an  unquoted  <blank>,  any token
       containing  the  previous  character  is delimited and the current
       character shall be discarded.
    9. If  the  previous  character  was  part  of  a  word,  the current
       character shall be appended to that word.
   10. If  the  current  character  is  a  '#'  ,  it  and all subsequent
       characters  up  to,  but  excluding,  the  next <newline> shall be
       discarded  as  a  comment. The <newline> that ends the line is not
       considered part of the comment.
   11. The current character is used as the start of a new word.

   Once  a  token  is  delimited,  it  is  categorized as required by the
   grammar in [35]Shell Grammar .



IMNSHO the back-quote form of command-substitution should have been
deprecated LONG ago!

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>