Re: bin/59395: sh mangles some UTF-8 input

To: gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,campbell+netbsd%mumble.net@localhost
Subject: Re: bin/59395: sh mangles some UTF-8 input
From: "Robert Elz via gnats" <gnats-admin%NetBSD.org@localhost>
Date: Mon, 5 May 2025 22:10:01 +0000 (UTC)

The following reply was made to PR bin/59395; it has been noted by GNATS.

From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: bin/59395: sh mangles some UTF-8 input
Date: Tue, 06 May 2025 05:08:27 +0700

     Date:        Sun,  4 May 2025 15:25:00 +0000 (UTC)
     From:        campbell+netbsd%mumble.net@localhost
     Message-ID:  <20250504152500.D69BC1A923F%mollari.NetBSD.org@localhost>

 OK, I understand this now.

   | I tried to reduce this test case further and the problem went away.

 It would - the problem exists only with here docs that are (internally,
 including extra added escape chars) > 1024 bytes long.

 [Note that the LC_CTYPE setting is irrelevant however, that rarely
  affects anything sh does - it mostly treats text as simply  sequences
  of bytes to be moved from place to place, and certainly  that's all
  that is done with here-docs, sh doesn't care in the  slightest what
  the char encoding looks like in there, provided there  are no \0 bytes
  involved anywhere. ]

 The problem is caused by an optimisation (an ancient one) where when
 reading and processing a here document (which was, at the time,
 expanded in parallel with sending the here doc to the relevant
 command's stdin), whenever the internal string buffer needed to be
 grown, and if at least 1024 bytes had already been collected, then
 rather than malloc'ing more memory, the code simply wrote the
 accumulated string down the pipe to the command, and then reused
 the space which had held that initial segment of the here doc.

 The bug is that when it did that, the shell's internal escape chars
 (0x81, aka 0201) were not removed from the string first.

 Eventually whatever remains of the here doc text (all of it whenever
 it is less than 1KB) gets any escape characters removed, and then
 the result of that is sent to the pipe (which is then closed, and the
 here-doc is done).

 This bug has never been fixed.

 However, when PR bin/53550 was fixed, the optimisation (which remains)
 was made inert, as its preconditions aren't met.  To fix that bug,
 the here doc needed to be expanded earlier, the pipe to the command
 which is to use it hasn't been set up yet (the shell has not forked
 to allow that command to be run), so the optimisation simply doesn't
 occur, the whole here doc text simply accumulates in memory, and is
 processed as the "whatever remains" (now the whole thing) above, and
 all is good.

 The fix for PR bin/53550 was never pulled up, and I don't want to
 attempt to do that, it was a messy change, which affected code all
 over the place, and most likely (though I haven't tried it) wouldn't
 apply cleanly to the -9 sources.

 However, the optimisation itself remains in the code in HEAD, even
 though that can never be used any more - I can easily simply remove
 that, and then pull up that change to -9 (only -9, -10 doesn't need it,
 it already has the 53550 fix which was done back in Nov 2021, well
 before -10 was branched).   That is all that is needed to fix
 this bug in -9, I have tested it.

 An alternative would be to add the extra line of code needed to
 remove the internal escape bytes, before doing the write to the pipe.
 I doubt that is the right way to go -- it would be absurd to attempt
 as a change to HEAD and a regular pullup, it would need to be
 done as simply a patch to -9 - and an ugly one, as the function
 which removes the escapes is a static in one file (expand.c) and
 the optimisation which would need to use it is in a different
 file (memalloc.c), so to do it this way that routine would need to
 be given global scope, moving its decl into a header file, which would
 then need be included in memalloc.c ... easy to do, but messy.

 Once that's done, assuming we do it the first way, there are some more
 related cleanups related to all of this which can be done as well (getting
 rid of more dead code), which will simply make the code a little cleaner,
 which I will do for HEAD (no need to pull any of that up anywhere).

 kre

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: Re: bin/59395 (sh mangles some UTF-8 input)
Previous by Thread: Re: bin/59395: sh mangles some UTF-8 input
Next by Thread: Re: port-vax/59326 pointer to global function in same object file points PLT entry
Indexes:

Home | Main Index | Thread Index | Old Index