Subject: bin/18738: tr(1) includes broken example
To: None <>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: netbsd-bugs
Date: 10/20/2002 08:42:21
>Number:         18738
>Category:       bin
>Synopsis:       tr(1) includes broken example
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    bin-bug-people
>State:          open
>Class:          doc-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Oct 20 05:43:00 PDT 2002
>Originator:     der Mouse
>Release:        -current
	The tr(1) manpage includes an example

     Translate the contents of file1 to upper-case.

           tr "[:lower:]" "[:upper:]" < file1

	which is slightly broken in that it will misbehave if the
	character set in use has lowercase letters with no
	corresponding uppercase letter, or vice versa.  While ASCII
	does not have any such, one of the commonest non-ASCII
	character sets, ISO 8859-1, does - there is no uppercase
	version of 0xff (y with double-dot diacritic).  0xdf (German
	ss) might be another example, though I'm not sure.

	8859-7 (Greek) is much more likely to be an example; it has at
	least three characteristics any one of which is liable to break
	that example:

	- I see no uppercase versions of 0xc0 or 0xe0 (iota and upsilon
	  with a diacritic I don't know any name for).

	- 0xd3 has two lowercase versions, 0xf2 and 0xf3 (sigma).

	- 0xb6, 0xb8, 0xb9, 0xba, 0xbc, 0xbe, and 0xbf are all
	  uppercase, and appear before the body of the uppercase
	  alphabet, but their corresponding lowercase versions, 0xdc,
	  0xdd, 0xde, 0xdf, 0xfc, 0xfd, and 0xfe, appear partly before
	  and partly after the body of the lowercase alphabet.  (These
	  are vowels with what looks a bit like an acute accent but I
	  think is a breathing mark of some sort.)

	The manpage says that [:upper:] and [:lower:] are in "ascending
	order", but does not clearly indicate whether this means
	alphabetical order, codeset numeric order, or something else.
	However, as far as I can see no choice of order can finesse an
	issue like the two variants of lowercase sigma; the only way to
	handle that and still make things like the manpage example work
	would be to have [:upper:] include two copies of uppercase
	sigma.  And not even that helps any with 8859-1's 0xff or
	8859-7's 0xc0 and 0xe0, where the set simply doesn't have any
	corresponding uppercase character (perhaps because it doesn't
	exist; I'm not sure in any of those three cases whether there
	exists any uppercase version in the relevant languages).  I
	suppose you could decree that 8859-1 0xff and 8859-7 0xc0 and
	0ex0 are neither uppercase nor lowercase, but quite aside from
	violating least surprise, I don't think that could reasonably
	be done with the lowercase sigmas.
	Read the manpage.  Think about character sets.
	Removing the example is the simplest fix, but the most
	dangerous, because the note about ordering for [:lower:] and
	[:upper:] implies that something very much like that example
	could be expected to work.  I'd prefer to change the wording of
	the example, perhaps something like

	When using a character set with lowercase and uppercase
	versions of all letters appearing in the same order (such as
	ASCII, but not common non-ASCII sets like ISO 8859-1 or
	8859-7), a command such as

	tr "[:lower:]" "[:upper:]"

	can be used to translate data to upper case.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B