Subject: Re: kern/28594: 2.0_RC5 lock-up/loop in checkaliases()
To: None <andreas@planix.com>
From: Andreas Wrede <andreas@planix.com>
List: netbsd-bugs
Date: 01/02/2005 18:03:24
--Apple-Mail-4--301730350
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed


On 9-Dec-04, at 9:30 AM, andreas@planix.com wrote:

>> Number:         28594
>> Category:       kern
>> Synopsis:       2.0_RC5 lock-up/lookp in checkaliases()
>> Confidential:   no
>> Severity:       critical
>> Priority:       high
>> Responsible:    kern-bug-people
>> State:          open
>> Class:          sw-bug
>> Submitter-Id:   net
>> Arrival-Date:   Thu Dec 09 14:30:00 +0000 2004
>> Originator:     Andreas Wrede <andreas@planix.com>
>> Release:        NetBSD 2.0_RC5
>> Organization:
> Planix, Inc.
>> Environment:
> 	
> 	
> System: NetBSD whome.planix.com 2.0_RC5 NetBSD 2.0_RC5 (PLANIX) #11:  
> Tue Nov 23 10:49:49 EST 2004  
> root@willy.wrede.pvt:/u1/netbsd-2.0/obj/sys/arch/i386/compile.i386/ 
> PLANIX i386
> Architecture: i386
> Machine: i386
>> Description:
> 	Every second or third night, during one of the find's in
> /etc/{daily|security}, the NetBSD 2.0/i3896 server locks up.  
> Keystrokes no
> longer echo on the serial console. Entering the debugger
> usually works. When trying to "reboot" , I get a "panic: lockmgr:  
> locking
> against myself". At the time of the lock-up, one of the XServe RAID  
> based
> 1TByte file system was mounted:
>
> df -h
> Filesystem    Size     Used     Avail Capacity  Mounted on
> /dev/raid1a   1.4G     1.1G      232M    83%    /
> /dev/raid1e   2.0G     1.4G      479M    75%    /var
> /dev/raid1f   3.9G     2.4M      3.7G     0%    /u1
> /dev/sd0a     1.0T     326G      632G    34%    /u5
> kernfs        1.0K     1.0K        0B   100%    /kern
> procfs        4.0K     4.0K        0B   100%    /proc
>
> /u5 is a ffsv1 filesystem 32 blocks short of the 1Tb mark. The same  
> lock-up
> occurs when the /u5 filesystem is 1Tb+ ffv2.
>
> Stopped in pid 25005.1 (find) at        netbsd:cpu_Debugger+0x4:        
>  leave
> db> bt
> cpu_Debugger(cc0e4b8c,c037ddb0,cc0e4b74,7ff,c1557000) at  
> netbsd:cpu_Debugger+0x4
> comintr(c12b8200,0,cb7d0010,30,cc0e0010) at netbsd:comintr+0x6b9
> Xintr_legacy4() at netbsd:Xintr_legacy4+0xa4
> --- interrupt ---
> checkalias(cfdf8688,120c,c12e6000,cfdfa160,c1908000) at  
> netbsd:checkalias+0x5e
> ufs_vinit(c12e6000,c128c300,c128c200,cc0e4ca8,c23528c0) at  
> netbsd:ufs_vinit+0x69
> ffs_vget(c12e6000,3978196,cc0e4d64,d595eb70,cc0e4cf8) at  
> netbsd:ffs_vget+0x274
> ufs_lookup(cc0e4d94,cfdf8540,cc0e4dac,c037d409,c05730a0) at  
> netbsd:ufs_lookup+0x6d4
> VOP_LOOKUP(cf3ed444,cc0e4e84,cc0e4e98,cc0e4e84,c0573820) at  
> netbsd:VOP_LOOKUP+0x2e
> lookup(cc0e4e74,cbfa6c00,400,cc0e4e8c,cc0e4e24) at netbsd:lookup+0x201
> namei(cc0e4e74,8081448,60,0,8081540) at netbsd:namei+0x138
> sys___lstat13(cd02e2ac,cc0e4f64,cc0e4f5c,0,c153f000) at  
> netbsd:sys___lstat13+0x58
> syscall_plain() at netbsd:syscall_plain+0x7e
> --- syscall (number 280) ---
> 0x480e7357:

I had two more lock-ups in checkaliases(), after upgrading the kernel  
to 2.0 release.  Checking the checkalias() routine in kern/vfs_subr.c  
in current, I find the changes made by mycroft in
revision 1.231:
date: 2004/08/13 22:48:06;  author: mycroft;  state: Exp;  lines: +59  
-54
There is an annoying deadlock that goes like this:
* Process A is closing one file descriptor belonging to a device.  In  
doing so,
   ffs_update() is called and starts writing a block synchronously.   
(Note: This
   leaves the vnode locked.  It also has other instances -- stdin, et al  
-- of
   the same device open, so v_usecount is definitely non-zero.)
* Process B does a revoke() on the device.  The revoke() has to wait  
for the
   vnode to be unlocked because ffs_update() is still in progress.
* Process C tries to open() the device.  It wedges in checkalias()  
repeatedly
   calling vget() because it returns EBUSY immediately.

It looks like the deadlock is triggered for me by a find (from  
/etc/daily/weekly) and other accesses by imap/mail and or webserver.

Should rev 1.231 be pulled up to the 2.0 branch?  And is the change in  
1.231 alone sufficient or are there other pre/corequisite patches  
needed?


-- 
	aew

--Apple-Mail-4--301730350
content-type: application/pgp-signature; x-mac-type=70674453;
	name=PGP.sig
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)

iD8DBQFB2H3CEh/h9J/TQyERAqEnAJ4hb5+AccrMYaxWdbxe/PaS1riUIgCfeK0W
qHsawkeE9S0mnWZkNejPUTk=
=LZ2w
-----END PGP SIGNATURE-----

--Apple-Mail-4--301730350--