NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/59395: sh mangles some UTF-8 input
The following reply was made to PR bin/59395; it has been noted by GNATS.
From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: bin/59395: sh mangles some UTF-8 input
Date: Tue, 06 May 2025 05:08:27 +0700
Date: Sun, 4 May 2025 15:25:00 +0000 (UTC)
From: campbell+netbsd%mumble.net@localhost
Message-ID: <20250504152500.D69BC1A923F%mollari.NetBSD.org@localhost>
OK, I understand this now.
| I tried to reduce this test case further and the problem went away.
It would - the problem exists only with here docs that are (internally,
including extra added escape chars) > 1024 bytes long.
[Note that the LC_CTYPE setting is irrelevant however, that rarely
affects anything sh does - it mostly treats text as simply sequences
of bytes to be moved from place to place, and certainly that's all
that is done with here-docs, sh doesn't care in the slightest what
the char encoding looks like in there, provided there are no \0 bytes
involved anywhere. ]
The problem is caused by an optimisation (an ancient one) where when
reading and processing a here document (which was, at the time,
expanded in parallel with sending the here doc to the relevant
command's stdin), whenever the internal string buffer needed to be
grown, and if at least 1024 bytes had already been collected, then
rather than malloc'ing more memory, the code simply wrote the
accumulated string down the pipe to the command, and then reused
the space which had held that initial segment of the here doc.
The bug is that when it did that, the shell's internal escape chars
(0x81, aka 0201) were not removed from the string first.
Eventually whatever remains of the here doc text (all of it whenever
it is less than 1KB) gets any escape characters removed, and then
the result of that is sent to the pipe (which is then closed, and the
here-doc is done).
This bug has never been fixed.
However, when PR bin/53550 was fixed, the optimisation (which remains)
was made inert, as its preconditions aren't met. To fix that bug,
the here doc needed to be expanded earlier, the pipe to the command
which is to use it hasn't been set up yet (the shell has not forked
to allow that command to be run), so the optimisation simply doesn't
occur, the whole here doc text simply accumulates in memory, and is
processed as the "whatever remains" (now the whole thing) above, and
all is good.
The fix for PR bin/53550 was never pulled up, and I don't want to
attempt to do that, it was a messy change, which affected code all
over the place, and most likely (though I haven't tried it) wouldn't
apply cleanly to the -9 sources.
However, the optimisation itself remains in the code in HEAD, even
though that can never be used any more - I can easily simply remove
that, and then pull up that change to -9 (only -9, -10 doesn't need it,
it already has the 53550 fix which was done back in Nov 2021, well
before -10 was branched). That is all that is needed to fix
this bug in -9, I have tested it.
An alternative would be to add the extra line of code needed to
remove the internal escape bytes, before doing the write to the pipe.
I doubt that is the right way to go -- it would be absurd to attempt
as a change to HEAD and a regular pullup, it would need to be
done as simply a patch to -9 - and an ugly one, as the function
which removes the escapes is a static in one file (expand.c) and
the optimisation which would need to use it is in a different
file (memalloc.c), so to do it this way that routine would need to
be given global scope, moving its decl into a header file, which would
then need be included in memalloc.c ... easy to do, but messy.
Once that's done, assuming we do it the first way, there are some more
related cleanups related to all of this which can be done as well (getting
rid of more dead code), which will simply make the code a little cleaner,
which I will do for HEAD (no need to pull any of that up anywhere).
kre
Home |
Main Index |
Thread Index |
Old Index