Subject: how to write secure internationalized shell scripts
To: None <tech-security@netbsd.org>
From: Bruno Haible <bruno@clisp.org>
List: tech-security
Date: 09/04/2003 17:55:09
Hi all,

Could some of you please tell me whether the proposed methodology for using
internationalization in shell scripts, based on GNU gettext, is safe enough?

The proposal for a hello-world program (that I want to incorporate in the
GNU gettext manual) looks like this:

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
#! /bin/sh

# Find a way to echo strings without interpreting backslash.
if test "X`(echo '\t') 2>/dev/null`" =3D 'X\t'; then
  echo=3D'echo'
else
  if test "X`(printf '%s\n' '\t') 2>/dev/null`" =3D 'X\t'; then
    echo=3D'printf %s\n'
  else
    echo_func () {
      cat <<EOT
$*
EOT
    }
    echo=3D'echo_func'
  fi
fi

TEXTDOMAIN=3Dhello
export TEXTDOMAIN
TEXTDOMAINDIR=3D/absolute/path/to/localedir
export TEXTDOMAINDIR

# Test whether the locale encoding is good or weird.
locale_charset_weird () {
  case `locale charmap | tr a-z A-Z` in
    BIG5 | BIG5-HKSCS | GBK | GB18030 | SHIFT_JIS | JOHAB) (exit 0);;
    *) (exit 1);;
  esac
}

use_backquote_workaround=3D
if locale_charset_weird; then
  s=3D`echo '()echo ()' | LC_ALL=3DC tr '(' '\340' | LC_ALL=3DC tr ')' '\14=
0'`
  if eval echo "$s" | grep echo > /dev/null; then
    : # OK, the shell can recognize multibyte characters correctly.
  else
    # The shell can mistakenly interpret double-byte characters like \xe0\x=
60.
    use_backquote_workaround=3Dyes
  fi
fi

if test -n "$use_backquote_workaround"; then
  eval_gettext () {
    _string=3D`gettext "$1" | LC_ALL=3DC tr -d '\177' | LC_ALL=3DC tr '\140=
' '\177'`
    eval _string=3D"\"$_string\""
    $echo "$_string" | LC_ALL=3DC tr '\177' '\140'
  }
  eval_ngettext () {
    _string=3D`ngettext "$1" "$2" "$3" | LC_ALL=3DC tr -d '\177' | LC_ALL=
=3DC tr '\140' '\177'`
    eval _string=3D"\"$_string\""
    $echo "$_string" | LC_ALL=3DC tr '\177' '\140'
  }
else
  eval_gettext () {
    _string=3D`gettext "$1"`
    eval _string=3D"\"$_string\""
    $echo "$_string"
  }
  eval_ngettext () {
    _string=3D`ngettext "$1" "$2" "$3"`
    eval _string=3D"\"$_string\""
    $echo "$_string"
  }
fi

# gettext can be used with literal strings without variables.
$echo "`gettext "Hello world"`"

# eval_gettext is for the cases where the string refers to variables.
$echo "`eval_gettext "Hello Mr. \\$USER, your terminal type is \\$TERM."`"

# eval_ngettext is for plural forms.
$echo "`eval_ngettext "a piece of cake" "\\$n pieces of cake" $n`"

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

The idea is that a message catalog created by the translator contains, say,

   #, sh-evaluated
   msgid "Hello world"
   msgstr "Hallo Welt"

   #, sh-evaluated sh-format
   msgid "Hello Mr. $USER, your terminal type is $TERM."
   msgstr "Hallo Herr $USER, Ihr Terminal ist ein $TERM."

   #, sh-evaluated sh-format
   msgid "a piece of cake"
   msgid_plural "$n pieces of cake"
   msgstr[0] "ein St=C5=B1ck Kuchen"
   msgstr[1] "$n St=C5=B1ck Kuchen"

Such a message catalog is transformed to a .mo file by the msgfmt program.
The 'sh-format' marker is used by msgfmt: "msgfmt -c" verifies that the
translation (msgstr) refers only to those variables that the original string
(msgid) already refers to. The 'sh-evaluated' marker is used by msgfmt
as well: "msgfmt -c" verifies that the translation does not use dangerous
constructs like `...` or $(...).

The 'gettext' and 'ngettext' programs access this .mo file to extract
the translations and convert them to the current locale's encoding. Then
the shell script functions 'eval_gettext' or 'eval_ngettext' evaluate
the resulting string, to get the variables' values substituted into it.

Can you see security problems associated with this methodology?

                Bruno