Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: cmake hang solution?



On Tue, Apr 05, 2022 at 02:10:36PM -0000, Michael van Elst wrote:
> wiz%NetBSD.org@localhost (Thomas Klausner) writes:
> >I never saw the cmake hang myself. I still see hangs in guile.
> 
> 
> I see both in almost every pbulk run.


please try this patch for the cmake variation of this hang:

http://www.netbsd.org/~chs/diff.pthread-park-stuck.1

this fixes the problem as seen with taylor's strdup / jemalloc reproducer,
and paulg reports it fixes the hang in building guile too.

what's going on here is that the first time that libpthread calls _lwp_park
when it wants to sleep to wait for a mutex, instead of calling the libc
function directly it first has to call into rtld to resolve the symbol.
the rtld code will call _rtld_shared_enter(), which might also need to sleep
using _lwp_park to wait for the rtld internal lock.  the rtld internal usage
of _lwp_park can accidentally consume an unpark from another thread that was
intended for the libpthread code, and if that happens, then when rtld is done
resolving the symbol and libpthread actually calls the real _lwp_park function,
the unpark has been lost and the libpthread call to _lwp_park will
sleep forever.

the above patch simply resolves the symbol for the libpthread call to _lwp_park
while the process is still single-threaded, by calling the _lwp_park to both
unpark and park itself, which just returns immediately.  after that,
the libpthread calls to _lwp_park will no longer call into rtld,
so attempts to unpark libpthread can no longer be lost.

when I wrote that patch I thought it would be a complete fix, but upon reading
the previous email threads about this problem I saw a mention of signals,
and signal handlers can call a wide variety of functions, so this patch
turns out to only fix what is probably the most common way that this problem
manifests.

the nature of lwp_park/unpark (with just a single "already unparked" flag
per thread in the kernel) is such that they cannot safely be used in a nested
fashion like they are in libpthread and rtld, so we need to change one or both
of these callers to use some other primitive to implement sleeping to wait
for a lock, such as futexes, which do not have this kind of per-thread flag
that prevents safe nested usage.

-Chuck


Home | Main Index | Thread Index | Old Index