NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/56979: fork(2) fails to be signal safe
The following reply was made to PR lib/56979; it has been noted by GNATS.
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: Tom Lane <tgl%sss.pgh.pa.us@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/56979: fork(2) fails to be signal safe
Date: Sun, 16 Oct 2022 10:11:52 +0000
> Date: Sat, 15 Oct 2022 21:17:37 -0400
> From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
>
> Taylor R Campbell <riastradh%NetBSD.org@localhost> writes:
> >> From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
> >> Didn't take long to find out that there's still a problem. With
> >> this patch, it gets past the fork() all right, but there's still
> >> a risk of the child process getting stuck on the RTLD lock later:
>
> > Do I understand correctly that this means you're trying to call dlopen
> > from a signal handler?
>
> Well, it *was* a signal handler, but once it issues fork() I wouldn't
> personally regard it as a signal handler anymore.
It's still in a signal handler because it's interrupting previously
running code somewhere in the middle. You have some control over what
code has been running, but not much:
> The child process
> is certainly never going to return control to the interrupted code.
> The parent process (the Postgres "postmaster") runs with signals blocked
> everywhere except this one select() call in its wait loop, so it's safer
> than it sounds. The postmaster has been coded like that since the
> nineties, and AFAIR this is the first bit of trouble we've had with it.
If the signal handler is blocked except during select, that might be
OK except for the trouble you're seeing with rtld lazy binding. But
even with LD_BIND_NOW or whatever, I'm not going to make any promises
about it! E.g., in a threaded program, I wouldn't put it past some
rtld to do RCU-style tidying in a background thread or something.
If you used pselect you could at least ensure that the signal is only
delivered during the system call, not in userland -- and as a bonus
you would also avoid possible deadlock if the signal delivery races
with the non-atomic sigprocmask/select sequence! (I don't know if
it's definitely a problem in postgres, but it is a very easy problem
to have without noticing for a while.) Personally I'd be more
comfortable relying on that, and I suspect it's a much smaller change
than moving logic from signal handlers to the select loop.
> OK, thanks for confirming that. What we've done about this for the
> moment is to force linking with -Wl,-z,now on NetBSD, which fixes
> this particular problem --- at least, we've not seen it since then
> on two different NetBSD test machines that previously did exhibit
> the failure intermittently --- and it seems like generally a good
> idea anyway.
That sounds like it might work for onw, although it's still skating on
thin ice and it would be safer to avoid dlopen in a signal handler
altogether or at least confine it to a signal delivered during a
system call of your choice as with pselect.
Given all that, I'm inclined to close this as fixed for the fork part,
and not-a-bug for the dlopen part.
Home |
Main Index |
Thread Index |
Old Index