NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/57226: NetBSD 10_BETA kernel crash
The following reply was made to PR kern/57226; it has been noted by GNATS.
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: PHO <pho%cielonegro.org@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/57226: NetBSD 10_BETA kernel crash
Date: Sun, 25 Jun 2023 13:34:57 +0000
> Date: Sun, 25 Jun 2023 21:38:03 +0900
> From: PHO <pho%cielonegro.org@localhost>
>=20
> I think I found the cause (after a painful debugging session spanning 48=
=20
> hours):
>=20
> The culprit was callout_halt(). "(c->c_flags & CALLOUT_FIRED) !=3D 0" was=
n't
> the correct way to check if a callout is running. It failed to wait for a
> running callout to finish in the following scenario:
>=20
> 1. cpu0 initializes a callout and schedules it.
> 2. cpu0 invokes callout_softlock() and fires the callout, setting the flag
> CALLOUT_FIRED.
> 3. The callout invokes callout_schedule() to re-schedule itself.
> 4. callout_schedule_locked() clears the flag CALLOUT_FIRED, and releases
> the lock.
> 5. Before the lock is re-acquired by callout_softlock(), cpu1 decides to
> destroy the callout. It first invokes callout_halt() to make sure the
> callout finishes running.
> 6. But since CALLOUT_FIRED has been cleared, callout_halt() thinks it's n=
ot
> running and therefore returns without invoking callout_wait().
> 7. cpu1 proceeds to invoke callout_destroy() while it's still running on
> cpu0. callout_destroy() detects that and panics.
Unfortunately, I don't think this can be the culprit, because:
1. Every path to callout_destroy, in itimer_fini, goes through
itimer_poison.
2. itimer_poison will, under the itimer lock:
(a) set it->it_dying,
(b) callout_halt and wait for any concurrent itimer_callout
3. itimer_callout will not reschedule itself if it->it_dying is set,
i.e., if there's any other threads that could potentially be in the
callout_halt in step 2(b). Nor will any other thread schedule the
callout if it->it_dying is set. And this decision is made under
the lock correctly to synchronize with itimer_poison.
It is part of the API contract that the callout can't be concurrently
rescheduled in order for callout_halt to guarantee that the callout is
no longer running.
As far as I can tell, kern_time.c satisfies that requirement of the
contract. If kern_time.c failed to satisfy that part of the contract,
we would probably see some other kasserts firing about it->it_dying.
So I'm afraid this change will paper over the symptom without fixing
the underlying problem.
Home |
Main Index |
Thread Index |
Old Index