NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)

The following reply was made to PR kern/52126; it has been noted by GNATS.

From: Frank Kardel <>
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 11:05:55 +0100

 On 12/21/19 08:50, Michael van Elst wrote:
 > The following reply was made to PR kern/52126; it has been noted by GNATS.
 > From: (Michael van Elst)
 > To:
 > Cc:
 > Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
 > Date: Sat, 21 Dec 2019 07:47:12 -0000 (UTC)
 > (Frank Kardel) writes:
 >   > Maybe I am missing omething, but the prerequisite condition for panic is
 >   > "!start_init_exec". Maybe the test was intended to be inverted so that
 >   > the crude
 >   > deadlock check is enabled while user level is running. Right now it only
 >   > checks during the initialization phase *before*
 >   Excessive waiting for the kernel lock is always a problem, functional
 >   (i.e. delayed interrupt processing and then lost interrupts) or just
 >   performance.
 I am not questioning that.Ideally we can manage to make the kernel lock 
 disappear some time.
 >   But after bootstrap a panic is probably too much, anything
 >   before can be handled by just fixing the code that holds the kernel
 >   lock for too long.
 During bootstrap a panic is even prohibitive. The long holding may not 
 even be caused by wrong code but
 by debug checks and or kernel output to a slow console and on (larger) 
 MP systems even caused by starvation due to multiple
 threads doing output to slow consoles. Especially in debug mode we have 
 even more unpredictable timing (memory size/resource dependent).
 So we are looking here at performance bugs and delays related to 
 starvation and consistency check code.
 My boot panics where caused when chasing the EFI device initialization 
 failures (which are almost gone - need to re-check).
 The kernel panicked all over the place but never reached user mode 
 because of that. The workaround was/is to disable DEBUG and pciverbose 
 or the spinout code.
 As the spinout code sets an absolute retry barrier any printf/debug 
 analysis within a kernel_lock section is risky. So while trying to catch 
 'long locks' we may be tripping also over false positives due to slow 
 output/bad design/starvation. I have no idea what the signal/noise ratio 
 is here.
 >   > user-level is running and in some circumstance even prohibits a
 >   > successful boot (slow raster console+debug+verbose boot).
 >   This just shows that slow raster consoles (kernel messages, not tty) have
 >   their limitations. Still a good thing to get this information because
 >   it points to a design problem (kernel lock held too long)
 Starvation may also be a cause for that. We could add a lock counter to 
 reset the spin backoff to avoid triggering
 on high contention. But this looks still like a bandaid.
 >   usually somewhere
 >   else. E.g. sdmmc(4) now sleeps in some places instead of spinning in delay()
 >   because of this.
 There are certainly many places where kernel_lock is held needlessly 
 long. Which need to be addressed while
 working on removing the big lock.
 >   There are places in our ata code (and probably mvsata too) that need
 >   similar attention.
 Definitely. This bug looks like there is some unexpected interaction 
 going on with the device. Also ipmi printfs are often a victim here 
 going by the bug report. Would be nice to get a stack trace or at least 
 the location where the lock way taken of the kernel lock owner at this time.
 But I still wonder whether this panic is a good idea due to signal/noise 
 ratio especially as is is active only during boot.

Home | Main Index | Thread Index | Old Index