kern/58091: after fork/execve or posix_spawn, parent kill(child, SIGTERM) has race condition making it unreliable

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/58091: after fork/execve or posix_spawn, parent kill(child, SIGTERM) has race condition making it unreliable
From: michael.dusan%gmail.com@localhost
Date: Sat, 30 Mar 2024 13:35:00 +0000 (UTC)

>Number:         58091
>Category:       kern
>Synopsis:       after fork/execve or posix_spawn, parent kill(child, SIGTERM) has race condition making it unreliable
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Mar 30 13:35:00 +0000 2024
>Originator:     Michael Dusan
>Release:        
>Organization:
Zig Software Foundation
>Environment:
NetBSD netbsd100-amd64 10.0_RC6 NetBSD 10.0_RC6 (GENERIC) #0: Tue Mar 12 10:19:02 UTC 2024  mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64

NetBSD netbsd93-amd64 9.3 NetBSD 9.3 (GENERIC) #0: Thu Aug  4 15:30:37 UTC 2022  mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64
>Description:
Fork/exec a child and first action of parent, send SIGTERM to child and ~3 out of million times the signal is never received by child.

Variant using posix_spawn tends to manifest much more frequently on netbsd 10.0 RC6, and more frequently on netbsd 9.3 .

Unable to reproduce this bug on archlinux, macos 14.0, freebsd 14.4,, openbsd 7.4, dragonfly 6.4 .

Using ktrace, I was able to see the bug (with the motivating .zig programming language code for this bug report) much more frequently and observed that the closer parent `kill()` call is in ktrace output to the child calling `execve()`, ie: immediately preceding, this bug manifests.

It seems that the signal is lost somewhere in kernel execve preparation.

>How-To-Repeat:
0. caution: running this bug may hose the system. In another incarnation it would end my ssh session (and other sessions to same netbsd system), requiring a reboot
1. see affixed but.c code
2. cc -o bug bug.c
3. in shell `repeat 1000 ./bug`
4. over time, the output "whups" indicates child did not end due to signal
5. it sometimes help to busy the sytem, eg. concurrently run step #3 in another shell
6. I usually observe 2 or 3 "whups" per invocation
7. testing env 1: qemu VM netbsd 10.0_RC6 as "8 core" guest
8. testing env 2: qemu VM netbsd 9.3 amd64 as "8 core" guest
9. VM host: archlinux, AMD Ryzen 9 7900X 12-Core Processor

///////////////////////////////////////////////////////////////////////////////
// bug.c
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

void doit() {
    pid_t pid = fork();
    if (pid == 0) {
        char *argv[] = { "sleep", "10", NULL };
        int res = execve("/bin/sleep", argv, NULL);
    } else {
        // we are parent
        if (kill(pid, SIGTERM) == -1) {
            fprintf(stderr, "kill: errno=%d\n", errno);
            return;
        }
        int status;
        if (waitpid(pid, &status, 0) == -1) {
            fprintf(stderr, "kill: errno=%d\n", errno);
            return;
        }
        if (!WIFSIGNALED(status)) {
            fprintf(stderr, "whups!\n");
        }
    }
}

int main() {
    for (int i = 0; i < 1000; i++) {
        doit();
    }
}


///////////////////////////////////////////////////////////////////////////////
// bug_posix.c
// this variant uses `posix_spawn()` instead of fork/execve
// here it's set to do 1 million iterations
// netbsd 10.0_RC3 emits "whups" over a hundred times on average
// netbsd 9.3 emits "whups" maybe 20 times on average

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <spawn.h>
#include <sys/wait.h>

void doit() {
    char *argv[] = { "sleep", "1", NULL };
    pid_t pid;
    if (posix_spawn(&pid, "/bin/sleep", NULL, NULL, argv, NULL) == -1) {
        fprintf(stderr, "posix_spawn: errno=%d\n", errno);
        return;
    }

    if (kill(pid, SIGTERM) == -1) {
        fprintf(stderr, "kill: errno=%d\n", errno);
        return;
    }

    int status;
    if (waitpid(pid, &status, 0) == -1) {
        fprintf(stderr, "kill: errno=%d\n", errno);
        return;
    }
    if (!WIFSIGNALED(status)) {
        fprintf(stderr, "whups!\n");
    }
}

int main() {
    for (int i = 0; i < 1000000; i++) {
        doit();
    }
}
>Fix:

Prev by Date: Re: toolchain/58089: MKREPRO isn't really reproductible
Next by Date: Re: install/58076: x86 live images don't support creds_msdos(8) and have confusing taxonomy
Previous by Thread: bin/58090: Ctrl-Z, fg makes blinking cursor in vi
Next by Thread: Re: install/58076: x86 live images don't support creds_msdos(8) and have confusing taxonomy
Indexes:

Home | Main Index | Thread Index | Old Index