NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/57226: NetBSD 10_BETA kernel crash



The following reply was made to PR kern/57226; it has been noted by GNATS.

From: PHO <pho%cielonegro.org@localhost>
To: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/57226: NetBSD 10_BETA kernel crash
Date: Mon, 26 Jun 2023 00:10:04 +0900

 On 6/25/23 22:34, Taylor R Campbell wrote:
 > 
 > Unfortunately, I don't think this can be the culprit, because:
 > 
 > 1. Every path to callout_destroy, in itimer_fini, goes through
 >     itimer_poison.
 > 
 > 2. itimer_poison will, under the itimer lock:
 >     (a) set it->it_dying,
 >     (b) callout_halt and wait for any concurrent itimer_callout
 > 
 > 3. itimer_callout will not reschedule itself if it->it_dying is set,
 >     i.e., if there's any other threads that could potentially be in the
 >     callout_halt in step 2(b).  Nor will any other thread schedule the
 >     callout if it->it_dying is set.  And this decision is made under
 >     the lock correctly to synchronize with itimer_poison.
 
 But what I found with a bunch of debugging logs is that:
 
 1. cpu0 calls timerfd_create() to initialize an itimer.
 
 2. The same CPU calls timerfd_settime() to schedule the timer.
 
 3. After some ticks itimer_callout() gets invoked. It re-schedules the 
 callout because it's not dying yet.
 
 4. itimer_callout() releases the itimer lock right before returning.
 
 5. But before itimer_callout() returns and callout_softclock() 
 re-acquires the callout lock, cpu1 calls timerfd_destroy().
 
 6. cpu1 can acquire the itimer lock (because it's now released), and 
 calls itimer_poison(). It marks the itimer as dying, but it's too late. 
 itimer_poison() invokes callout_halt(), which returns without waiting 
 for the callout to finish because c->c_flags no longer contains 
 CALLOUT_FIRED. The flag has been cleared on the step 3.
 
 7. cpu1 then calls itimer_fini(), which in turn calls callout_destroy(). 
 cpu0 is still in callout_softclock() and trying to lock the callout, 
 which makes cpu1 panic.
 
 > It is part of the API contract that the callout can't be concurrently
 > rescheduled in order for callout_halt to guarantee that the callout is
 > no longer running.
 > 
 > As far as I can tell, kern_time.c satisfies that requirement of the
 > contract.  If kern_time.c failed to satisfy that part of the contract,
 > we would probably see some other kasserts firing about it->it_dying.
 > 
 > So I'm afraid this change will paper over the symptom without fixing
 > the underlying problem.
 
 Then kern_time.c indeed fails to satisfy the contract, doesn't it? It's 
 because the itimer lock is released upon entering and exiting 
 itimer_callout(). Of course we cannot keep it locked during these 
 points, so I think the contract is basically unsatisfiable.
 



Home | Main Index | Thread Index | Old Index