[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
The following reply was made to PR kern/52126; it has been noted by GNATS.
From: Frank Kardel <kardel%netbsd.org@localhost>
Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
Date: Sat, 21 Dec 2019 11:05:55 +0100
On 12/21/19 08:50, Michael van Elst wrote:
> The following reply was made to PR kern/52126; it has been noted by GNATS.
> From: mlelstv%serpens.de@localhost (Michael van Elst)
> To: gnats-bugs%netbsd.org@localhost
> Subject: Re: kern/52126 (mvsata Marvell 88SX6081 panics on boot)
> Date: Sat, 21 Dec 2019 07:47:12 -0000 (UTC)
> kardel%netbsd.org@localhost (Frank Kardel) writes:
> > Maybe I am missing omething, but the prerequisite condition for panic is
> > "!start_init_exec". Maybe the test was intended to be inverted so that
> > the crude
> > deadlock check is enabled while user level is running. Right now it only
> > checks during the initialization phase *before*
> Excessive waiting for the kernel lock is always a problem, functional
> (i.e. delayed interrupt processing and then lost interrupts) or just
I am not questioning that.Ideally we can manage to make the kernel lock
disappear some time.
> But after bootstrap a panic is probably too much, anything
> before can be handled by just fixing the code that holds the kernel
> lock for too long.
During bootstrap a panic is even prohibitive. The long holding may not
even be caused by wrong code but
by debug checks and or kernel output to a slow console and on (larger)
MP systems even caused by starvation due to multiple
threads doing output to slow consoles. Especially in debug mode we have
even more unpredictable timing (memory size/resource dependent).
So we are looking here at performance bugs and delays related to
starvation and consistency check code.
My boot panics where caused when chasing the EFI device initialization
failures (which are almost gone - need to re-check).
The kernel panicked all over the place but never reached user mode
because of that. The workaround was/is to disable DEBUG and pciverbose
or the spinout code.
As the spinout code sets an absolute retry barrier any printf/debug
analysis within a kernel_lock section is risky. So while trying to catch
'long locks' we may be tripping also over false positives due to slow
output/bad design/starvation. I have no idea what the signal/noise ratio
> > user-level is running and in some circumstance even prohibits a
> > successful boot (slow raster console+debug+verbose boot).
> This just shows that slow raster consoles (kernel messages, not tty) have
> their limitations. Still a good thing to get this information because
> it points to a design problem (kernel lock held too long)
Starvation may also be a cause for that. We could add a lock counter to
reset the spin backoff to avoid triggering
on high contention. But this looks still like a bandaid.
> usually somewhere
> else. E.g. sdmmc(4) now sleeps in some places instead of spinning in delay()
> because of this.
There are certainly many places where kernel_lock is held needlessly
long. Which need to be addressed while
working on removing the big lock.
> There are places in our ata code (and probably mvsata too) that need
> similar attention.
Definitely. This bug looks like there is some unexpected interaction
going on with the device. Also ipmi printfs are often a victim here
going by the bug report. Would be nice to get a stack trace or at least
the location where the lock way taken of the kernel lock owner at this time.
But I still wonder whether this panic is a good idea due to signal/noise
ratio especially as is is active only during boot.
Main Index |
Thread Index |