Re: kern/57226: NetBSD 10_BETA kernel crash

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,taca%back-street.net@localhost
Subject: Re: kern/57226: NetBSD 10_BETA kernel crash
From: PHO <pho%cielonegro.org@localhost>
Date: Sun, 25 Jun 2023 15:15:02 +0000 (UTC)

The following reply was made to PR kern/57226; it has been noted by GNATS.

From: PHO <pho%cielonegro.org@localhost>
To: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/57226: NetBSD 10_BETA kernel crash
Date: Mon, 26 Jun 2023 00:10:04 +0900

 On 6/25/23 22:34, Taylor R Campbell wrote:
 > 
 > Unfortunately, I don't think this can be the culprit, because:
 > 
 > 1. Every path to callout_destroy, in itimer_fini, goes through
 >     itimer_poison.
 > 
 > 2. itimer_poison will, under the itimer lock:
 >     (a) set it->it_dying,
 >     (b) callout_halt and wait for any concurrent itimer_callout
 > 
 > 3. itimer_callout will not reschedule itself if it->it_dying is set,
 >     i.e., if there's any other threads that could potentially be in the
 >     callout_halt in step 2(b).  Nor will any other thread schedule the
 >     callout if it->it_dying is set.  And this decision is made under
 >     the lock correctly to synchronize with itimer_poison.

 But what I found with a bunch of debugging logs is that:

 1. cpu0 calls timerfd_create() to initialize an itimer.

 2. The same CPU calls timerfd_settime() to schedule the timer.

 3. After some ticks itimer_callout() gets invoked. It re-schedules the 
 callout because it's not dying yet.

 4. itimer_callout() releases the itimer lock right before returning.

 5. But before itimer_callout() returns and callout_softclock() 
 re-acquires the callout lock, cpu1 calls timerfd_destroy().

 6. cpu1 can acquire the itimer lock (because it's now released), and 
 calls itimer_poison(). It marks the itimer as dying, but it's too late. 
 itimer_poison() invokes callout_halt(), which returns without waiting 
 for the callout to finish because c->c_flags no longer contains 
 CALLOUT_FIRED. The flag has been cleared on the step 3.

 7. cpu1 then calls itimer_fini(), which in turn calls callout_destroy(). 
 cpu0 is still in callout_softclock() and trying to lock the callout, 
 which makes cpu1 panic.

 > It is part of the API contract that the callout can't be concurrently
 > rescheduled in order for callout_halt to guarantee that the callout is
 > no longer running.
 > 
 > As far as I can tell, kern_time.c satisfies that requirement of the
 > contract.  If kern_time.c failed to satisfy that part of the contract,
 > we would probably see some other kasserts firing about it->it_dying.
 > 
 > So I'm afraid this change will paper over the symptom without fixing
 > the underlying problem.

 Then kern_time.c indeed fails to satisfy the contract, doesn't it? It's 
 because the itimer lock is released upon entering and exiting 
 itimer_callout(). Of course we cannot keep it locked during these 
 points, so I think the contract is basically unsatisfiable.

Prev by Date: Re: kern/57226: NetBSD 10_BETA kernel crash
Next by Date: Re: kern/57226: NetBSD 10_BETA kernel crash
Previous by Thread: Re: kern/57226: NetBSD 10_BETA kernel crash
Next by Thread: Re: kern/57226: NetBSD 10_BETA kernel crash
Indexes:

Home | Main Index | Thread Index | Old Index