Re: kern/56979: fork(2) fails to be signal safe

To: lib-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,tgl%sss.pgh.pa.us@localhost
Subject: Re: kern/56979: fork(2) fails to be signal safe
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 16 Oct 2022 10:15:02 +0000 (UTC)

The following reply was made to PR lib/56979; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: Tom Lane <tgl%sss.pgh.pa.us@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/56979: fork(2) fails to be signal safe
Date: Sun, 16 Oct 2022 10:11:52 +0000

 > Date: Sat, 15 Oct 2022 21:17:37 -0400
 > From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
 > 
 > Taylor R Campbell <riastradh%NetBSD.org@localhost> writes:
 > >> From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
 > >> Didn't take long to find out that there's still a problem.  With
 > >> this patch, it gets past the fork() all right, but there's still
 > >> a risk of the child process getting stuck on the RTLD lock later:
 > 
 > > Do I understand correctly that this means you're trying to call dlopen
 > > from a signal handler?
 > 
 > Well, it *was* a signal handler, but once it issues fork() I wouldn't
 > personally regard it as a signal handler anymore.

 It's still in a signal handler because it's interrupting previously
 running code somewhere in the middle.  You have some control over what
 code has been running, but not much:

 >                                                    The child process
 > is certainly never going to return control to the interrupted code.
 > The parent process (the Postgres "postmaster") runs with signals blocked
 > everywhere except this one select() call in its wait loop, so it's safer
 > than it sounds.  The postmaster has been coded like that since the
 > nineties, and AFAIR this is the first bit of trouble we've had with it.

 If the signal handler is blocked except during select, that might be
 OK except for the trouble you're seeing with rtld lazy binding.  But
 even with LD_BIND_NOW or whatever, I'm not going to make any promises
 about it!  E.g., in a threaded program, I wouldn't put it past some
 rtld to do RCU-style tidying in a background thread or something.

 If you used pselect you could at least ensure that the signal is only
 delivered during the system call, not in userland -- and as a bonus
 you would also avoid possible deadlock if the signal delivery races
 with the non-atomic sigprocmask/select sequence!  (I don't know if
 it's definitely a problem in postgres, but it is a very easy problem
 to have without noticing for a while.)  Personally I'd be more
 comfortable relying on that, and I suspect it's a much smaller change
 than moving logic from signal handlers to the select loop.

 > OK, thanks for confirming that.  What we've done about this for the
 > moment is to force linking with -Wl,-z,now on NetBSD, which fixes
 > this particular problem --- at least, we've not seen it since then
 > on two different NetBSD test machines that previously did exhibit
 > the failure intermittently --- and it seems like generally a good
 > idea anyway.

 That sounds like it might work for onw, although it's still skating on
 thin ice and it would be safer to avoid dlopen in a signal handler
 altogether or at least confine it to a signal delivered during a
 system call of your choice as with pselect.

 Given all that, I'm inclined to close this as fixed for the fork part,
 and not-a-bug for the dlopen part.

Prev by Date: Re: kern/56979: fork(2) fails to be signal safe
Next by Date: Re: kern/56979: fork(2) fails to be signal safe
Previous by Thread: Re: kern/56979: fork(2) fails to be signal safe
Next by Thread: Re: kern/56979: fork(2) fails to be signal safe
Indexes:

Home | Main Index | Thread Index | Old Index