NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/57226: NetBSD 10_BETA kernel crash
The following reply was made to PR kern/57226; it has been noted by GNATS.
From: PHO <pho%cielonegro.org@localhost>
To: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/57226: NetBSD 10_BETA kernel crash
Date: Mon, 26 Jun 2023 00:10:04 +0900
On 6/25/23 22:34, Taylor R Campbell wrote:
>
> Unfortunately, I don't think this can be the culprit, because:
>
> 1. Every path to callout_destroy, in itimer_fini, goes through
> itimer_poison.
>
> 2. itimer_poison will, under the itimer lock:
> (a) set it->it_dying,
> (b) callout_halt and wait for any concurrent itimer_callout
>
> 3. itimer_callout will not reschedule itself if it->it_dying is set,
> i.e., if there's any other threads that could potentially be in the
> callout_halt in step 2(b). Nor will any other thread schedule the
> callout if it->it_dying is set. And this decision is made under
> the lock correctly to synchronize with itimer_poison.
But what I found with a bunch of debugging logs is that:
1. cpu0 calls timerfd_create() to initialize an itimer.
2. The same CPU calls timerfd_settime() to schedule the timer.
3. After some ticks itimer_callout() gets invoked. It re-schedules the
callout because it's not dying yet.
4. itimer_callout() releases the itimer lock right before returning.
5. But before itimer_callout() returns and callout_softclock()
re-acquires the callout lock, cpu1 calls timerfd_destroy().
6. cpu1 can acquire the itimer lock (because it's now released), and
calls itimer_poison(). It marks the itimer as dying, but it's too late.
itimer_poison() invokes callout_halt(), which returns without waiting
for the callout to finish because c->c_flags no longer contains
CALLOUT_FIRED. The flag has been cleared on the step 3.
7. cpu1 then calls itimer_fini(), which in turn calls callout_destroy().
cpu0 is still in callout_softclock() and trying to lock the callout,
which makes cpu1 panic.
> It is part of the API contract that the callout can't be concurrently
> rescheduled in order for callout_halt to guarantee that the callout is
> no longer running.
>
> As far as I can tell, kern_time.c satisfies that requirement of the
> contract. If kern_time.c failed to satisfy that part of the contract,
> we would probably see some other kasserts firing about it->it_dying.
>
> So I'm afraid this change will paper over the symptom without fixing
> the underlying problem.
Then kern_time.c indeed fails to satisfy the contract, doesn't it? It's
because the itimer lock is released upon entering and exiting
itimer_callout(). Of course we cannot keep it locked during these
points, so I think the contract is basically unsatisfiable.
Home |
Main Index |
Thread Index |
Old Index