Re: kern/57226: NetBSD 10_BETA kernel crash

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,taca%back-street.net@localhost
Subject: Re: kern/57226: NetBSD 10_BETA kernel crash
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Sun, 25 Jun 2023 13:40:01 +0000 (UTC)

The following reply was made to PR kern/57226; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: PHO <pho%cielonegro.org@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/57226: NetBSD 10_BETA kernel crash
Date: Sun, 25 Jun 2023 13:34:57 +0000

 > Date: Sun, 25 Jun 2023 21:38:03 +0900
 > From: PHO <pho%cielonegro.org@localhost>
 >=20
 > I think I found the cause (after a painful debugging session spanning 48=
 =20
 > hours):
 >=20
 > The culprit was callout_halt(). "(c->c_flags & CALLOUT_FIRED) !=3D 0" was=
 n't
 > the correct way to check if a callout is running. It failed to wait for a
 > running callout to finish in the following scenario:
 >=20
 > 1. cpu0 initializes a callout and schedules it.
 > 2. cpu0 invokes callout_softlock() and fires the callout, setting the flag
 >    CALLOUT_FIRED.
 > 3. The callout invokes callout_schedule() to re-schedule itself.
 > 4. callout_schedule_locked() clears the flag CALLOUT_FIRED, and releases
 >    the lock.
 > 5. Before the lock is re-acquired by callout_softlock(), cpu1 decides to
 >    destroy the callout. It first invokes callout_halt() to make sure the
 >    callout finishes running.
 > 6. But since CALLOUT_FIRED has been cleared, callout_halt() thinks it's n=
 ot
 >    running and therefore returns without invoking callout_wait().
 > 7. cpu1 proceeds to invoke callout_destroy() while it's still running on
 >    cpu0. callout_destroy() detects that and panics.
 
 Unfortunately, I don't think this can be the culprit, because:
 
 1. Every path to callout_destroy, in itimer_fini, goes through
    itimer_poison.
 
 2. itimer_poison will, under the itimer lock:
    (a) set it->it_dying,
    (b) callout_halt and wait for any concurrent itimer_callout
 
 3. itimer_callout will not reschedule itself if it->it_dying is set,
    i.e., if there's any other threads that could potentially be in the
    callout_halt in step 2(b).  Nor will any other thread schedule the
    callout if it->it_dying is set.  And this decision is made under
    the lock correctly to synchronize with itimer_poison.
 
 It is part of the API contract that the callout can't be concurrently
 rescheduled in order for callout_halt to guarantee that the callout is
 no longer running.
 
 As far as I can tell, kern_time.c satisfies that requirement of the
 contract.  If kern_time.c failed to satisfy that part of the contract,
 we would probably see some other kasserts firing about it->it_dying.
 
 So I'm afraid this change will paper over the symptom without fixing
 the underlying problem.

Prev by Date: Re: port-arm/57488: awk(1) sometimes dumps core on SHEEVAPLUG (maybe KIRKWOOD)
Next by Date: Re: kern/57226: NetBSD 10_BETA kernel crash
Previous by Thread: Re: kern/57226: NetBSD 10_BETA kernel crash
Next by Thread: Re: kern/57226: NetBSD 10_BETA kernel crash
Indexes:

Home | Main Index | Thread Index | Old Index