Subject: kern/11906: panics related to softdep/nfsd
To: None <gnats-bugs@gnats.netbsd.org>
From: None <p@ppires.org>
List: netbsd-bugs
Date: 01/06/2001 14:35:15
>Number:         11906
>Category:       kern
>Synopsis:       System panics in softdep code, triggered by nfsd operations.
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 06 14:35:00 PST 2001
>Closed-Date:
>Last-Modified:
>Originator:     Paulo Alexandre Pinto Pires
>Release:        NetBSD-current 2001/01/06 (1.5Q)
>Organization:
>Environment:
	
System: NetBSD domine.ppires.org 1.5Q NetBSD 1.5Q (DOMINE-20010106) #0: Sat Jan 6 02:04:52 BRST 2001 pappires@mateus.ppires.org:/usr/src/sys/arch/i386/compile/DOMINE-20010106 i386
Architecture: i386
Machine: i386, Pentium 133MHz, 40Mb RAM
>Description:

The system appears to have problems under special conditions, but I
cannot precise what these conditions are.  Most often, panics occur
while an NFS client is running Netscape Messenger, as it parses,
reindexes or compacts message folders located on the NFS server.  Such
operations do not look troublesome to me at first glance, but they
seem to cause the NFS server to break.

The problem was first observed in 1.5P, as of 2000/12/24.  When trig-
gered, I got panic messages as show below:

	panic: softdep_write_inodeblock: direct pointer #xx mismatch yy != zz
	Stopped in pid nnn (nfsd) at    cpu_Debugger+0x4:       leave

Even though I upgraded the NFS server machine to 1.5Q, as of 2001/01/06,
it still panics, as mentioned above.  The only difference from the
behaviour of 1.5P is that I got an extra message a few minutes
before the panic:

	ffs_fsync: dirty: tag 1 type VREG, usecount 1, writecount 0, refcount 1,
		tag VT_UFS, ino 892802, on dev 0, 8 flags 0x0, effnlink 1, nlink 1
		mode 0100644, owner 1001, group 256, size 196608 lock type vnlock: EXCL (count 1) by pid 132

At that moment, I was running Netscape Messenger in a 1.5P (2000/12/24)
NFS client machine.  Then, after a few (two or three) folder switches,
I evenually got the panic message.

	panic: softdep_write_inodeblock: direct pointer #7 mismatch 0 != 3569544
	Stopped in pid 132 (nfsd) at    cpu_Debugger+0x4:       leave
	db> continue
	syncing disks... panic: lockmgr: locking against myself

And after reboot, I got this from gdb:

	pappires@domine:/var/crash [9]: gdb netbsd.4
	GNU gdb 4.17
	Copyright 1998 Free Software Foundation, Inc.
	GDB is free software, covered by the GNU General Public License, and you are
	welcome to change it and/or distribute copies of it under certain conditions.
	Type "show copying" to see the conditions.
	There is absolutely no warranty for GDB.  Type "show warranty" for details.
	This GDB was configured as "i386--netbsd"...(no debugging symbols found)...
	(gdb) target kcore netbsd.4.core
	panic: %s: direct pointer #%d mismatch %d != %d
	#0  0x104 in ?? ()
	(gdb) backtrace
	#0  0x104 in ?? ()
	#1  0xc01f149f in cpu_reboot ()
	#2  0xc0133671 in panic ()
	#3  0xc0124e3e in lockmgr ()
	#4  0xc01544ec in genfs_lock ()
	#5  0xc01521ef in VOP_LOCK ()
	#6  0xc0151a46 in vn_lock ()
	#7  0xc014bcb2 in vget ()
	#8  0xc01d3d10 in ffs_sync ()
	#9  0xc014df86 in sys_sync ()
	#10 0xc014cfd8 in vfs_shutdown ()
	#11 0xc01f1477 in cpu_reboot ()
	#12 0xc0133671 in panic ()
	#13 0xc01cf9db in initiate_write_inodeblock ()
	#14 0xc01cf677 in softdep_disk_io_initiation ()
	#15 0xc015a3ae in spec_strategy ()
	#16 0xc01526ec in VOP_STRATEGY ()
	#17 0xc0146c0c in bwrite ()
	#18 0xc01cb55c in ffs_update ()
	#19 0xc015255c in VOP_UPDATE ()
	#20 0xc01cbc86 in ffs_truncate ()
	#21 0xc015251a in VOP_TRUNCATE ()
	#22 0xc01d9221 in ufs_setattr ()
	#23 0xc0151cb0 in VOP_SETATTR ()
	#24 0xc0196e65 in nfsrv_setattr ()
	#25 0xc01b21df in nfssvc_nfsd ()
	#26 0xc01b195f in sys_nfssvc ()
	#27 0xc01f57dd in syscall_plain ()
	#28 0xc0100d66 in syscall1 ()
	can not access 0xbfbfdce8, invalid translation (invalid PDE)
	can not access 0xbfbfdce8, invalid translation (invalid PDE)
	Cannot access memory at address 0xbfbfdce8.

At the moment, I had the following processes running.

	PID TT  STAT    TIME COMMAND
	  0 ??  DKs  0:00.00 (swapper)
	  1 ??  TWs  0:00.00 init 
	  2 ??  DK   0:00.00 (usb0)
	  3 ??  DK   0:00.00 (apm0)
	  4 ??  DK   0:00.00 (pagedaemon)
	  5 ??  DK   0:00.00 (reaper)
	  6 ??  DK   0:00.00 (ioflush)
	  7 ??  DK   0:00.00 (aiodoned)
	 73 ??  Ts   0:00.00 /usr/sbin/syslogd -s 
	 83 ??  Ts   0:00.00 /usr/sbin/named 
	 87 ??  TWs  0:00.00 /usr/sbin/rpcbind -l 
	 91 ??  Ts   0:00.00 /usr/sbin/ypserv -d 
	 95 ??  Ts   0:00.00 /usr/sbin/ypbind 
	 99 ??  TWs  0:00.00 /usr/sbin/rpc.yppasswdd 
	120 ??  TWs  0:00.00 (mountd)
	129 ??  TWs  0:00.00 (nfsd)
	132 ??  RL   0:02.00 nfsd: server 
	133 ??  TL   0:00.00 nfsd: server 
	134 ??  TL   0:00.00 nfsd: server 
	135 ??  TL   0:00.00 (nfsd)
	136 ??  TWs  0:00.00 nfsd: server (rpc.statd)
	138 ??  TWs  0:00.00 (rpc.lockd)
	150 ??  TWs  0:00.00 (amd)
	154 ??  IK   0:00.00 (nfsio)
	155 ??  IK   0:00.00 (nfsio)
	156 ??  IK   0:00.00 (nfsio)
	157 ??  IK   0:00.00 (nfsio)
	163 ??  TWs  0:00.00 (timed)
	170 ??  TWs  0:00.00 (rwhod)
	190 ??  TWs  0:00.00 /usr/sbin/lpd 
	195 ??  Ts   0:00.00 /usr/local/sbin/httpd 
	200 ??  TW   0:00.00 /usr/local/sbin/httpd 
	201 ??  TW   0:00.00 (httpd)
	202 ??  TWs  0:00.00 /usr/pkg/sbin/smbd -D 
	204 ??  Ts   0:00.00 /usr/pkg/sbin/nmbd -D 
	207 ??  TW   0:00.00 /usr/pkg/sbin/nmbd -D 
	214 ??  TWs  0:00.00 /usr/local/sbin/sshd (sshd1)
	221 ??  TWs  0:00.00 (unlinkd) (unlinkd)
	222 ??  TWs  0:00.00 /usr/sbin/dhcpd -q 
	227 ??  TWs  0:00.00 (apmd)
	236 ??  TWs  0:00.00 sendmail: accepting connections 
	241 ??  TWs  0:00.00 /usr/sbin/inetd -l 
	244 ??  TWs  0:00.00 /usr/sbin/cron 
	249 ??  TWs  0:00.00 (dnsserver) (dnsserver)
	250 ??  TWs  0:00.00 (dnsserver) (dnsserver)
	251 ??  TWs  0:00.00 (dnsserver) (dnsserver)
	252 ??  Ts   0:00.00 (pinger) (pinger)
	253 ??  T    0:00.00 (sshd1)
	265 ??  TWs  0:00.00 (pppd)
	266 p1  TWs  0:00.00 (tcsh)
	444 p1  T+   0:00.00 (ftp)
	 79 E0- T    0:00.00 /usr/sbin/ipmon -sn 
	208 E0- TW   0:00.00 /bin/sh /usr/local/sbin/RunCache 
	213 E0- T    0:00.00 squid -NY 
	246 E0  TWs+ 0:00.00 (getty)
	247 E1  TWs+ 0:00.00 /usr/libexec/getty Pc ttyE1 
	248 E2  TWs+ 0:00.00 /usr/libexec/getty Pc ttyE2 

>How-To-Repeat:
	Set up an NFS server and an NFS client.  Have the NFS server
	mount its file system with softdeps enabled and start poking
	with Netscape message folders over NFS.  Even small folders
	(ten messages or less!) will do.

	Even though the problem occurs in an NFS environment, it is
	possible that a single computer, with local file system
	operations only, may experience the same problem.  So do I
	believe because there were some messages in the current-users
	mailing list that mentioned similar symptoms but no NFS.

>Fix:
	
>Release-Note:
>Audit-Trail:
>Unformatted: