At Mon, 16 Jan 2012 09:58:01 -0500, Thor Lancelot Simon <tls%panix.com@localhost> wrote: Subject: Re: ongoing major problems with NetBSD-5 and LOCKDEBUG on multi-core system > > On Sun, Jan 15, 2012 at 10:08:42PM -0800, Greg A. Woods wrote: > > So I was finally able to get a new server, and its a nice big Dell > > PE2950 with 32GB RAM, lots-o-disk on a PERC-6/i, and a pair of zippy > > Intel Xeon E5440 CPUs (quad-cores x2). > > At a guess, look for a locking bug in the PERC driver. Aside from > that, this is a pretty ordinary system and much like those others > run with LOCKDEBUG all the time. > > Usually the SPL NOT LOWERED business means a missing unlock. It could > also mean sleeping with a lock held, in such a way that with a > preemptible kernel you can return to userspace without releasing the > lock. The PERC 6/i driver is mfi(4). A quick peek showed that some time ago there was a pullup to the netbsd-5 branch of 1.31 of sys/dev/ic/mfi.c. This change also seems to be related to the change in 1.28, which has also been pulled up to the netbsd-5 branch. However even 1.31 and 1.28 taken together still look somehow wrong to a quick glance of my locking-naive eye. The problem seems to be exactly what you've hinted at: The internal calls which are now wrapped in KERNEL_LOCK() by the aforementioned changes subsequently call mfi_mgmt_internal() before calling KERNEL_UNLOCK_ONE(), and mfi_mgmt_internal() can call tsleep(). I'm not sure what the right solution is since mfi_mgmt_internal() is used in a number of other places where KERNEL_LOCK() apparently isn't needed.... The hackish thought I had was to pass in a flag to say whether or not the lock has been held or not. I'll try that if someone can suggest that I'm on the right path here, or if anyone has any better ideas, please do let me know! -- Greg A. Woods Planix, Inc. <woods%planix.com@localhost> +1 250 762-7675 http://www.planix.com/
Attachment:
pgpm7QNxTMTUT.pgp
Description: PGP signature