lib/56979: fork(2) fails to be signal safe

To: lib-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: lib/56979: fork(2) fails to be signal safe
From: tgl%sss.pgh.pa.us@localhost
Date: Thu, 25 Aug 2022 16:30:00 +0000 (UTC)

>Number:         56979
>Category:       lib
>Synopsis:       fork(2) fails to be signal safe
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    lib-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Aug 25 16:30:00 +0000 2022
>Originator:     Tom Lane
>Release:        HEAD/202208100950Z
>Organization:
PostgreSQL Global Development Group
>Environment:
NetBSD cube.sss.pgh.pa.us 9.99.99 NetBSD 9.99.99 (GENERIC) #0: Wed Aug 10 08:38:43 UTC 2022  mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/macppc/compile/GENERIC macppc
>Description:
I traced down a deadlock that occurs rarely while starting a Postgres server on current NetBSD.
The stack trace is

(gdb) bt
#0  0xfdeede98 in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1  0xfdee3d48 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2  0xfdee5dac in __locked_fork () from /usr/libexec/ld.elf_so
#3  0xfdd0650c in fork () from /usr/lib/libc.so.12
#4  0x01c1bd18 in fork_process () at fork_process.c:59
#5  0x01c1fbb0 in StartChildProcess (type=WalWriterProcess)
    at postmaster.c:5398
#6  0x01c212b8 in reaper (postgres_signal_arg=<optimized out>)
    at postmaster.c:3076
#7  <signal handler called>
#8  0xfdee195c in _rtld_bind () from /usr/libexec/ld.elf_so
#9  0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

This seems to be a self-deadlock occurring because _rtld_bind() already did _rtld_shared_enter() and now __locked_fork() wants that mutex exclusively.
Although gdb fails to trace any further back, digging in the stack identified the
previous PC as 0x01c2b81c, which is

   0x1c2b804 <PostmasterMain+4544>:     lwz     r3,108(r1)
   0x1c2b808 <PostmasterMain+4548>:     mr      r7,r28
   0x1c2b80c <PostmasterMain+4552>:     mr      r4,r23
   0x1c2b810 <PostmasterMain+4556>:     li      r6,0
   0x1c2b814 <PostmasterMain+4560>:     li      r5,0
   0x1c2b818 <PostmasterMain+4564>:     bl      0x1ee6230 <__select50@plt>
-> 0x1c2b81c <PostmasterMain+4568>:     li      r5,0

Evidently, this is the first time this select(2) call has been reached in this process,
and we're trying to resolve the PLT entry, and while that is happening a SIGCHLD
signal occurs, leading the signal handler to try to fork a new child process.

I realize that calling system functions from signal handlers is generally deprecated;
but POSIX specifies that fork(2) is safe to call from a signal handler, which IMO
makes this a NetBSD bug.

>How-To-Repeat:
I've run into this a few times while running Postgres regression tests, but it's very hard to reproduce that way.  The "startup process" child process has to exit before the parent postmaster process reaches the select(2) in its idle loop for the first time, which would be very unusual timing given the relative amounts of work to be done in each process.  A bespoke test program might be a better way to make it reproducible.
>Fix:
Is there a way to not need the RTLD lock during fork()?

Follow-Ups:
- Re: lib/56979: fork(2) fails to be signal safe
  - From: Joerg Sonnenberger

Prev by Date: Re: kern/49431 (External HDMI does not detect plugging -- DRMKMS)
Next by Date: Re: lib/56979: fork(2) fails to be signal safe
Previous by Thread: kern/56978: nvme hangs under very heavy loads
Next by Thread: Re: lib/56979: fork(2) fails to be signal safe
Indexes:

Home | Main Index | Thread Index | Old Index