NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/56828: futex calls in Linux emulation sometimes hang



> Date: Sat, 18 Jan 2025 11:36:27 +0100
> From: Thomas Klausner <wiz%NetBSD.org@localhost>
> 
> The futex tests look much better now, but still quite a lot are
> failing (mostly futex_wait issues):
> 
> futex_cmp_requeue01.c:95: TBROK: fork() failed: EAGAIN/EWOULDBLOCK (11)
> tst_test.c:1606: TINFO: Killed the leftover descendant processes

Looks like you hit a process rlimit.  Can you bump ulimit -p or
kern.maxproc?

> *** futex_wait03 ***
> 
> tst_memutils.c:141: TINFO: oom_score_adj does not exist, skipping the adjustment
> tst_test.c:1558: TINFO: Timeout per run is 0h 00m 30s
> tst_memutils.c:141: TINFO: oom_score_adj does not exist, skipping the adjustment
> futex_wait03.c:63: TINFO: Testing variant: syscall with old kernel spec
> Test timeouted, sending SIGKILL!
> tst_test.c:1612: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
> tst_test.c:1614: TBROK: Test killed! (timeout?)
> 
> Summary:
> passed   0
> failed   0
> broken   1
> skipped  0
> warnings 0

I suspect this is a bug in NetBSD's implementation of /proc/$pid/stat.

This is the only test case that queries it from another thread, I
think, and it looks like when that happens, /proc/$pid/stat doesn't
correctly report the other thread as sleeping (`S') when it is waiting
in futex(FUTEX_WAIT), so the wait-for-sleep busy loop spins forever
(or until timeout).

Could add a printf after TST_PROCESS_STATE_WAIT (and an fflush after
that) to verify that the test never gets past that loop.

> *** futex_wait05 ***
> [...]
> tst_timer_test.c:263: TINFO: futex_wait() sleeping for 1000us 500 iterations, threshold 450.01us
> tst_timer_test.c:285: TINFO: Found 500 outliners in [20098,13688] range
> tst_timer_test.c:305: TINFO: min 13688us, max 20098us, median 20000us, trunc mean 19976.46us (discarded 25)
> tst_timer_test.c:314: TFAIL: futex_wait() slept for too long

These failures are all about the limited resolution of sleeps.  I'm
guessing you're running at 100 Hz.  These times are around 1-2 ticks
past the requested deadline, or 10-20ms = 10000-20000us (plus a tiny
slop of a few dozen microseconds).  I would expect this to slow things
down but not make them deadlock.

> I tried the Metalworks demo from jdk17 and it worked fine.
> 
> Then I tried the PDF-Over application.
> I could get one successful run through the application, but
> I had about 9 other tries where it didn't complete the process.
> Mostly not show the PDF (step 2 of the process), or show
> just a gray screen.
> 
> Right now top says the process is in futex, so I suspect there are
> still more problems. Perhaps the futex_wait() problem bites us here.

Boo.  I guess we need to kernhist it up to find what futex events had
recently happened before the deadlock.


Home | Main Index | Thread Index | Old Index