NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: ongoing major problems with NetBSD-5 and LOCKDEBUG on multi-core system (mfi(4) related?)



At Wed, 18 Jan 2012 08:48:30 -0500, Thor Lancelot Simon 
<tls%panix.com@localhost> wrote:
Subject: Re: ongoing major problems with NetBSD-5 and LOCKDEBUG on multi-core 
system (mfi(4) related?)
> 
> I don't think either of these is the right way.
> 
> Unfortunately, I don't know enough about our storage drivers these days
> to recommend one as a good example for you to look at; but I think it
> would be a much better idea to look at another storage driver and see
> how the locking works than to copy OpenBSD or throw KERNEL_LOCK()
> calls around, which is what the changes in the source tree seem to
> have done.

Indeed, though now that I've removed the possibly errant uses of
KERNEL_LOCK() in mfi(4), even if incorrectly, and I'm still getting
LOCKDEBUG panics, I'm beginning to think I should be looking elsewhere
for the cause of the main trouble I'm seeing.


> What is being protected by those calls?  The driver, from another kernel
> subsystem?  Or the driver, from itself?  If the latter, it would be much
> better to think about the driver's actual locking needs and add or adjust
> locks accordingly.

I _think_ it's the former -- the initial change adding KERNEL_LOCK() and
splbio() calls was around code which calls an internal function to grab
sensor data on behalf of sysmon_envsys(9):

----------------------------
revision 1.28
date: 2009/08/26 21:41:05;  author: bouyer;  state: Exp;  lines: +8 -5
mfi.c still uses the spl() synchronisation scheme and so needs the kernel lock.
The sysmon subsystem is marked MPSAFE and so runs without the kernel lock.
So get the kernel lock in mfi_sensor_refresh()  before calling
mfi_ioctl_vol().
This fixes command list corruption seen on heavy I/O load on the mfi
driver(4).
----------------------------


The second change was to wrap up all the calls to functions inside the
main ioctl() routine, because bio(4) can call it too:

----------------------------
revision 1.31
date: 2010/01/19 20:54:32;  author: bouyer;  state: Exp;  lines: +7 -3
branches:  1.31.2;
bio(4) is MP-safe but mfi(4) is not. So get the kernel_lock at
mfi_ioctl() entry and release it on exit.
----------------------------


The problem is all of these ioctl() internal functions basically post a
command to the hardware then call tsleep() waiting for it to complete.

So if I understand correctly these calls from other subsystems are not
in an interrupt context and if I remember correctly tsleep() should be
fine in any normal driver ioctl() function (for a non-MPSAFE driver, if
indeed its ioctl() function was in the device switch table), but clearly
grabbing the KERNEL_LOCK() is the wrong thing to do for this case, at
least without releasing it properly before the tsleep() call.

It seems like bio(4) and sysmon_envsys(9) might be a little aggressive
in their assumptions too -- perhaps they should not assume the device
callbacks they use are in MPSAFE parts of the system.  I think _they_
should know to use a mutex to prevent overlapping calls in these cases.
Either that or they should be removed from drivers which are not MPSAFE.


The splbio()/splx() calls I added from OpenBSD seem quite reasonable and
correct, at least to me :-), but they don't solve the problem with
sysmon_envsys(9) and bio(4) calling into a non-MPSAFE driver.


I don't suppose the OpenBSD driver has been made MPSAFE yet, and in a
way I could copy to NetBSD....  It does have the CCB mutex....


Obviously I'll have to do something about this before I go on:

20:29 [8] $ /usr/sbin/envstat -r
load: 0.00  cmd: envstat 3421 [tstile] 0.00u 0.00s 0% 116k
^?load: 0.00  cmd: envstat 3421 [tstile] 0.00u 0.00s 0% 116k
^Zload: 0.00  cmd: envstat 3421 [tstile] 0.00u 0.00s 0% 116k

(a "kill -9 3421" from root is no help either :-))


-- 
                                                Greg A. Woods
                                                Planix, Inc.

<woods%planix.com@localhost>       +1 250 762-7675        http://www.planix.com/

Attachment: pgpUU5R9zXTlr.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index