NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/58091: after fork/execve or posix_spawn, parent kill(child, SIGTERM) has race condition making it unreliable
>Number: 58091
>Category: kern
>Synopsis: after fork/execve or posix_spawn, parent kill(child, SIGTERM) has race condition making it unreliable
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Mar 30 13:35:00 +0000 2024
>Originator: Michael Dusan
>Release:
>Organization:
Zig Software Foundation
>Environment:
NetBSD netbsd100-amd64 10.0_RC6 NetBSD 10.0_RC6 (GENERIC) #0: Tue Mar 12 10:19:02 UTC 2024 mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64
NetBSD netbsd93-amd64 9.3 NetBSD 9.3 (GENERIC) #0: Thu Aug 4 15:30:37 UTC 2022 mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Fork/exec a child and first action of parent, send SIGTERM to child and ~3 out of million times the signal is never received by child.
Variant using posix_spawn tends to manifest much more frequently on netbsd 10.0 RC6, and more frequently on netbsd 9.3 .
Unable to reproduce this bug on archlinux, macos 14.0, freebsd 14.4,, openbsd 7.4, dragonfly 6.4 .
Using ktrace, I was able to see the bug (with the motivating .zig programming language code for this bug report) much more frequently and observed that the closer parent `kill()` call is in ktrace output to the child calling `execve()`, ie: immediately preceding, this bug manifests.
It seems that the signal is lost somewhere in kernel execve preparation.
>How-To-Repeat:
0. caution: running this bug may hose the system. In another incarnation it would end my ssh session (and other sessions to same netbsd system), requiring a reboot
1. see affixed but.c code
2. cc -o bug bug.c
3. in shell `repeat 1000 ./bug`
4. over time, the output "whups" indicates child did not end due to signal
5. it sometimes help to busy the sytem, eg. concurrently run step #3 in another shell
6. I usually observe 2 or 3 "whups" per invocation
7. testing env 1: qemu VM netbsd 10.0_RC6 as "8 core" guest
8. testing env 2: qemu VM netbsd 9.3 amd64 as "8 core" guest
9. VM host: archlinux, AMD Ryzen 9 7900X 12-Core Processor
///////////////////////////////////////////////////////////////////////////////
// bug.c
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
void doit() {
pid_t pid = fork();
if (pid == 0) {
char *argv[] = { "sleep", "10", NULL };
int res = execve("/bin/sleep", argv, NULL);
} else {
// we are parent
if (kill(pid, SIGTERM) == -1) {
fprintf(stderr, "kill: errno=%d\n", errno);
return;
}
int status;
if (waitpid(pid, &status, 0) == -1) {
fprintf(stderr, "kill: errno=%d\n", errno);
return;
}
if (!WIFSIGNALED(status)) {
fprintf(stderr, "whups!\n");
}
}
}
int main() {
for (int i = 0; i < 1000; i++) {
doit();
}
}
///////////////////////////////////////////////////////////////////////////////
// bug_posix.c
// this variant uses `posix_spawn()` instead of fork/execve
// here it's set to do 1 million iterations
// netbsd 10.0_RC3 emits "whups" over a hundred times on average
// netbsd 9.3 emits "whups" maybe 20 times on average
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <spawn.h>
#include <sys/wait.h>
void doit() {
char *argv[] = { "sleep", "1", NULL };
pid_t pid;
if (posix_spawn(&pid, "/bin/sleep", NULL, NULL, argv, NULL) == -1) {
fprintf(stderr, "posix_spawn: errno=%d\n", errno);
return;
}
if (kill(pid, SIGTERM) == -1) {
fprintf(stderr, "kill: errno=%d\n", errno);
return;
}
int status;
if (waitpid(pid, &status, 0) == -1) {
fprintf(stderr, "kill: errno=%d\n", errno);
return;
}
if (!WIFSIGNALED(status)) {
fprintf(stderr, "whups!\n");
}
}
int main() {
for (int i = 0; i < 1000000; i++) {
doit();
}
}
>Fix:
Home |
Main Index |
Thread Index |
Old Index