NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
bin/60275: sh(1): race condition in signal handling on background subshell fork
>Number: 60275
>Category: bin
>Synopsis: sh(1): race condition in signal handling on background subshell fork
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat May 16 23:25:00 +0000 2026
>Originator: Taylor R Campbell
>Release: 11, 9
>Organization:
The NetBSD Shell Corporation, Inc.
>Environment:
>Description:
I'm trying to run a shell script that:
1. manages some background jobs with job control,
2. has a trap handler to kill jobs if anything goes awry, and
3. has a timeout enforced by a watchdog timer child process
that sleeps and kills its parent.
When the parent has completed before the timeout, it kills the
watchdog timer child process before the child can kill the
parent:
sleep 10 && echo timeout >&2 && kill $$ & timer=$!
... do stuff ...
kill $timer; wait $timer 2>/dev/null
(Yes, I could run the shell script itself under timeout(1)
instead of this gruesomely shakespearian familial death
struggle, but it's not always convenient to arrange that.)
However, sometimes, if the parent completes _too quickly_, the
child receives but discards the SIGTERM and proceeds to sleep
and kill the parent anyway, as shown by the following ktrace
output:
4168 4168 sh CALL fork
4168 4168 sh RET fork 17122/0x42e2
17122 17122 sh EMUL "netbsd"
17122 17122 sh RET fork 0
...
4168 4168 sh CALL kill(0x42e2, SIGTERM)
4168 4168 sh RET kill 0
4168 4168 sh CALL kill(0x42e2, SIGCONT)
17122 17122 sh PSIG SIGTERM caught handler=0x821ba80 mask=(): code=SI_USER sent by pid=4168, uid=1000)
... [nothing in pid 17122 elided here, only other processes] ...
17122 17122 sh CALL setcontext(0x7f7fffd1caa0)
17122 17122 sh RET setcontext JUSTRETURN
17122 17122 sh CALL __sigaction_sigtramp(SIGHUP,0x7f7fffd1cd70,0x7f7fffd1cd90,0x76999e0a1da0,2)
17122 17122 sh RET __sigaction_sigtramp 0
17122 17122 sh CALL __sigprocmask14(2,0x7f7fffd1cdc0,0)
17122 17122 sh RET __sigprocmask14 0
17122 17122 sh CALL __sigaction_sigtramp(SIGINT,0x7f7fffd1cd70,0x7f7fffd1cd90,0x76999e0a1da0,2)
17122 17122 sh RET __sigaction_sigtramp 0
17122 17122 sh CALL __sigprocmask14(2,0x7f7fffd1cdc0,0)
17122 17122 sh RET __sigprocmask14 0
...
17122 17122 sh CALL __vfork14
...
9116 9116 sh EMUL "netbsd"
9116 9116 sh RET fork 0
4168 4168 sh CALL dup2(3,2)
4168 4168 sh RET dup2 2
4168 4168 sh CALL close(3)
4168 4168 sh RET close 0
4168 4168 sh CALL __wait450(0xffffffff,0x7f7fffd1cc0c,0x12,0)
9116 9116 sh CALL execve(0x76999ec8c510,0x76999ec8c368,0x76999ec8c408)
9116 9116 sh NAMI "/bin/sleep"
If I insert a `sleep 0.01' between spawning and killing the
timer child, it works reliably -- haven't seen it fail yet.
I skimmed some of the code in /bin/sh to find where the calls
to sigaction and sigprocmask were coming from, but it wasn't
obvious. Reproduced in 9 and in 11.
>How-To-Repeat:
$ cat >test.sh <<EOF
#!/bin/sh
set -m
status=123
trap '
case $? in 0);; *) status=$?;; esac
trap -
kill -9 %+ 2>/dev/null
wait %+ 2>/dev/null
exit $status
' ALRM EXIT HUP INT PIPE TERM
sleep 1 && echo timeout >&2 && kill $$ & timer=$!
#sleep 0.01
kill $timer; wait $timer 2>/dev/null || :
status=0
EOF
$ (i=0; while :; do if sh ./test.sh; then i=$((i + 1)); else echo status=$? i=$i; break; fi; done)
timeout
status=143 i=0
$
With the `sleep 0.01', takes thousands of iterations to fail.
>Fix:
Yes, please!
Home |
Main Index |
Thread Index |
Old Index