NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

bin/60275: sh(1): race condition in signal handling on background subshell fork



>Number:         60275
>Category:       bin
>Synopsis:       sh(1): race condition in signal handling on background subshell fork
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat May 16 23:25:00 +0000 2026
>Originator:     Taylor R Campbell
>Release:        11, 9
>Organization:
The NetBSD Shell Corporation, Inc.
>Environment:
>Description:

	I'm trying to run a shell script that:

	1. manages some background jobs with job control,
	2. has a trap handler to kill jobs if anything goes awry, and
	3. has a timeout enforced by a watchdog timer child process
	   that sleeps and kills its parent.

	When the parent has completed before the timeout, it kills the
	watchdog timer child process before the child can kill the
	parent:

		sleep 10 && echo timeout >&2 && kill $$ & timer=$!
		... do stuff ...
		kill $timer; wait $timer 2>/dev/null

	(Yes, I could run the shell script itself under timeout(1)
	instead of this gruesomely shakespearian familial death
	struggle, but it's not always convenient to arrange that.)

	However, sometimes, if the parent completes _too quickly_, the
	child receives but discards the SIGTERM and proceeds to sleep
	and kill the parent anyway, as shown by the following ktrace
	output:

  4168   4168 sh       CALL  fork
  4168   4168 sh       RET   fork 17122/0x42e2
 17122  17122 sh       EMUL  "netbsd"
 17122  17122 sh       RET   fork 0
...
  4168   4168 sh       CALL  kill(0x42e2, SIGTERM)
  4168   4168 sh       RET   kill 0
  4168   4168 sh       CALL  kill(0x42e2, SIGCONT)
 17122  17122 sh       PSIG  SIGTERM caught handler=0x821ba80 mask=(): code=SI_USER sent by pid=4168, uid=1000)
... [nothing in pid 17122 elided here, only other processes] ...
 17122  17122 sh       CALL  setcontext(0x7f7fffd1caa0)
 17122  17122 sh       RET   setcontext JUSTRETURN
 17122  17122 sh       CALL  __sigaction_sigtramp(SIGHUP,0x7f7fffd1cd70,0x7f7fffd1cd90,0x76999e0a1da0,2)
 17122  17122 sh       RET   __sigaction_sigtramp 0
 17122  17122 sh       CALL  __sigprocmask14(2,0x7f7fffd1cdc0,0)
 17122  17122 sh       RET   __sigprocmask14 0
 17122  17122 sh       CALL  __sigaction_sigtramp(SIGINT,0x7f7fffd1cd70,0x7f7fffd1cd90,0x76999e0a1da0,2)
 17122  17122 sh       RET   __sigaction_sigtramp 0
 17122  17122 sh       CALL  __sigprocmask14(2,0x7f7fffd1cdc0,0)
 17122  17122 sh       RET   __sigprocmask14 0
...
 17122  17122 sh       CALL  __vfork14
...
  9116   9116 sh       EMUL  "netbsd"
  9116   9116 sh       RET   fork 0
  4168   4168 sh       CALL  dup2(3,2)
  4168   4168 sh       RET   dup2 2
  4168   4168 sh       CALL  close(3)
  4168   4168 sh       RET   close 0
  4168   4168 sh       CALL  __wait450(0xffffffff,0x7f7fffd1cc0c,0x12,0)
  9116   9116 sh       CALL  execve(0x76999ec8c510,0x76999ec8c368,0x76999ec8c408) 
  9116   9116 sh       NAMI  "/bin/sleep"

	If I insert a `sleep 0.01' between spawning and killing the
	timer child, it works reliably -- haven't seen it fail yet.

	I skimmed some of the code in /bin/sh to find where the calls
	to sigaction and sigprocmask were coming from, but it wasn't
	obvious.  Reproduced in 9 and in 11.

>How-To-Repeat:

$ cat >test.sh <<EOF
#!/bin/sh

set -m
status=123
trap '
	case $? in 0);; *) status=$?;; esac
	trap -
	kill -9 %+ 2>/dev/null
	wait %+ 2>/dev/null
	exit $status
' ALRM EXIT HUP INT PIPE TERM
sleep 1 && echo timeout >&2 && kill $$ & timer=$!
#sleep 0.01
kill $timer; wait $timer 2>/dev/null || :
status=0
EOF
$ (i=0; while :; do if sh ./test.sh; then i=$((i + 1)); else echo status=$? i=$i; break; fi; done)
timeout
status=143 i=0
$ 
	With the `sleep 0.01', takes thousands of iterations to fail.

>Fix:

	Yes, please!




Home | Main Index | Thread Index | Old Index