Subject: kern/9153: LFS deadlock during stress-testing
To: None <gnats-bugs@gnats.netbsd.org>
From: Havard Eidnes <he@runit.no>
List: netbsd-bugs
Date: 01/09/2000 09:48:24
>Number: 9153
>Category: kern
>Synopsis: LFS deadlock during stress-testing
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people (Kernel Bug People)
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sun Jan 9 09:45:00 2000
>Last-Modified:
>Originator: Havard Eidnes
>Organization:
RUNIT AS
>Release: NetBSD-1.4.2_ALPHA
>Environment:
NetBSD albatross.runit.sintef.no 1.4.2_ALPHA NetBSD 1.4.2_ALPHA (ALBATROSS) #18: Wed Jan 5 18:49:22 MET 2000 he@albatross.runit.sintef.no:/usr/src/sys/arch/alpha/compile/ALBATROSS alpha
System:
As above.
>Description:
My test LFS file system has managed to deadlock itself again
with multiple processes waiting in "lfsino" (including the
cleaner) and with update wedged on another channel. This
apparently happened while one of my "bonnie -s 2000" runs was
in the "seek" phase, /etc/daily's find job was running, and
lots of files on the LFS is also being removed.
The process list captured while running is:
albatross% ps axl
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
0 0 0 0 -18 0 0 31000 schedu DLs ?? 0:00.26 (swapper)
0 1 0 0 10 0 656 320 wait Is ?? 0:00.02 init
0 2 0 0 -18 0 0 31000 daemon DL ?? 0:00.00 (pagedaemon)
0 3 0 0 -18 0 0 31000 reaper DL ?? 0:00.03 (reaper)
0 114 1 0 2 0 176 672 select Ss ?? 1:32.43 syslogd
0 122 1 0 2 0 104 728 select Is ?? 0:25.51 portmap -l
0 126 1 0 10 0 100704 704 mfsidl Is ?? 0:00.09 mount_mfs -s 200000 -o nosuid -o nodev /de
0 154 1 0 -5 0 32 496 lfs_di Ds ?? 11:00.62 update 30
0 156 1 0 10 0 392 832 nanosl Is ?? 0:01.64 cron
0 158 1 0 18 -12 784 1168 pause S<s ?? 0:12.71 xntpd -p /var/run/xntpd.pid
0 163 1 0 2 0 176 848 select Is ?? 0:18.62 inetd -l
13841 175 1 27 95 20 14912 15656 - RN ?? 3496:28.93 ./setiathome
0 184 1 0 2 0 640 1040 select Is ?? 0:03.03 /usr/pkg/sbin/sshd (sshd1)
0 187 184 0 2 0 848 1352 select I ?? 0:01.13 /usr/pkg/sbin/sshd (sshd1)
0 192 184 0 2 0 848 1360 select I ?? 0:00.44 /usr/pkg/sbin/sshd (sshd1)
0 199 184 0 2 0 848 1360 select S ?? 0:23.56 /usr/pkg/sbin/sshd (sshd1)
0 277 1 0 10 0 88 560 wait Is ?? 0:00.00 /usr/libexec/lfs_cleanerd /spare1
0 278 277 30 -14 0 131120 69184 lfsino D ?? 231:15.26 /usr/libexec/lfs_cleanerd /spare1
0 1335 184 0 2 0 848 1360 select S ?? 0:22.13 /usr/pkg/sbin/sshd (sshd1)
0 7998 184 0 2 0 880 1392 select S ?? 0:02.97 /usr/pkg/sbin/sshd (sshd1)
0 25337 184 0 2 0 880 1416 select I ?? 0:00.17 /usr/pkg/sbin/sshd (sshd1)
0 26546 156 0 2 0 392 912 netio I ?? 0:00.00 CRON (cron)
0 26547 26546 0 10 0 1032 480 wait Is ?? 0:00.01 /bin/sh -c /bin/sh /etc/daily 2>&1 | tee /
0 26548 26547 0 10 0 1072 520 wait I ?? 0:00.02 /bin/sh /etc/daily
0 26549 26547 0 2 0 56 584 netio I ?? 0:00.01 tee /var/log/daily.out
0 26551 26547 0 2 0 1168 1528 netio I ?? 0:00.07 sendmail -t
0 26558 26548 6 -14 0 1328 1896 lfsino D ?? 0:16.90 find / ( ! -fstype local -o -fstype rdonly
13841 189 187 6 18 0 1040 1440 pause Is p0 0:00.07 -tcsh
0 205 189 0 18 0 1008 536 pause I p0 0:00.10 -csh
0 407 205 0 3 0 176 768 ttyin I+ p0 0:00.02 script
0 408 407 0 3 0 192 712 ttyin S+ p0 0:01.01 script
13841 195 192 0 18 0 1056 1472 pause Is p1 0:00.05 -tcsh
0 22721 195 17 18 0 960 520 pause I p1 0:00.07 -csh
0 22740 22721 8 10 0 1248 624 wait I p1 0:00.54 sh
0 25897 22740 7 35 0 616 224 cleane I+ p1 0:01.06 rm -rf pkgsrc usr xsrc
13841 201 199 0 18 0 1056 1472 pause Is p2 0:00.04 -tcsh
13841 217 201 0 3 0 672 1344 ttyin S+ p2 2:28.89 systat vm
0 409 408 0 18 0 960 520 pause Is p3 0:00.03 -bin/csh -i
0 413 409 0 10 0 1096 424 wait I p3 0:00.77 sh
0 25674 413 3 2 0 80 688 netio I+ p3 5:46.87 bonnie -s 2000
0 25925 25674 3 31 0 80 592 cleane I+ p3 0:00.01 bonnie -s 2000
0 25926 25674 3 -14 0 80 592 lfsino D+ p3 0:00.00 bonnie -s 2000
0 25927 25674 3 -14 0 80 592 lfsino D+ p3 0:00.00 bonnie -s 2000
13841 1337 1335 0 18 0 1072 1488 pause Is p4 0:00.09 -tcsh
13841 1348 1337 0 3 0 672 1368 ttyin S+ p4 2:22.99 systat vm
13841 585 8000 0 28 0 816 368 - R+ p5 0:00.00 ps -axl
13841 8000 7998 0 18 0 1064 1504 pause Ss p5 0:00.12 -tcsh
13841 25339 25337 0 18 0 1064 1528 pause Is p6 0:00.10 -tcsh
13841 25348 25339 0 2 0 80 648 select S+ p6 0:01.63 tail -f update.12
13841 25350 25339 0 2 0 496 896 netio I+ p6 0:00.07 egrep -v ^cvs server: Updat
0 186 1 18 3 0 112 784 ttyin Is+ C0 0:00.01 /usr/libexec/getty std.9600 console
albatross%
The default resource limits are:
albatross% limit
cputime unlimited
filesize unlimited
datasize 131072 kbytes
stacksize 2048 kbytes
coredumpsize unlimited
memoryuse 224968 kbytes
descriptors 64
memorylocked 74989 kbytes
maxproc 80
albatross%
I captured a crash dump, and dug out the following details:
The process traces are of
o the active lfs_cleanerd
o update
o find
o one of the bonnie processes (pid 25927)
albatross# gdb -q netbsd.12.gdb
(gdb) target kcore netbsd.12.core
#0 0xfffffc0000477d78 in dumpsys ()
at ../../../../arch/alpha/alpha/machdep.c:1444
1444 savectx(&dumppcb);
(gdb) source /sys/gdbscripts/xps
(gdb) xps
proc pid flag st wchan comm
0xfffffc000e66c5d0 26558 4004 3 0xfffffc000ddefe70 find (lfsinode)
0xfffffc000e66cd70 26551 4184 3 0xfffffc000d6562c0 sendmail (netio)
0xfffffc0009b901f8 26549 4084 3 0xfffffc0006d8d158 tee (netio)
0xfffffc0009b905c8 26548 4084 3 0xfffffc0009b905c8 sh (wait)
0xfffffc0009b91320 26547 4084 3 0xfffffc0009b91320 sh (wait)
0xfffffc0009b91ac0 26546 184 3 0xfffffc000d656500 cron (netio)
0xfffffc0009b903e0 25927 6 3 0xfffffc000ddefe70 bonnie (lfsinode)
0xfffffc0009b91508 25926 6 3 0xfffffc000ddefe70 bonnie (lfsinode)
0xfffffc0009b91138 25925 86 3 0xfffffe000011502c bonnie (cleaner)
0xfffffc0009b90f50 25897 4086 3 0xfffffe000011502c rm (cleaner)
0xfffffc0005cd7318 25674 4086 3 0xfffffc0006d8d4b8 bonnie (netio)
0xfffffc0009b90998 25339 4086 3 0xfffffc000235aad0 tcsh (ttyin)
0xfffffc0009b91ca8 25337 84 3 0xfffffc0000510528 sshd1 (select)
0xfffffc0009b918d8 22740 4086 3 0xfffffc0009b918d8 sh (wait)
0xfffffc0009b90d68 22721 4086 3 0xfffffe001078a1a0 csh (pause)
0xfffffc0009b90b80 8000 4086 3 0xfffffc000235a978 tcsh (ttyin)
0xfffffc0005cd78d0 7998 84 3 0xfffffc0000510528 sshd1 (select)
0xfffffc0005cd7ca0 1348 4186 3 0xfffffc000235a820 systat (ttyin)
0xfffffc0005cd7ab8 1337 4086 3 0xfffffe001077a1a0 tcsh (pause)
0xfffffc0005cd7500 1335 84 3 0xfffffc0000510528 sshd1 (select)
0xfffffc0005cd76e8 413 4086 3 0xfffffc0005cd76e8 sh (wait)
0xfffffc0005cd6f48 409 4086 3 0xfffffe001076a1a0 csh (pause)
---Type <return> to continue, or q <return> to quit---
0xfffffc0005cd6990 408 86 3 0xfffffc000235a740 script (ttyin)
0xfffffc0005cd7130 407 4086 3 0xfffffc000235a2c0 script (ttyin)
0xfffffc0005cd6d60 278 4 3 0xfffffc000b706180 lfs_cleanerd (lfsinode)
0xfffffc0005cd6b78 277 84 3 0xfffffc0005cd6b78 lfs_cleanerd (wait)
0xfffffc0005cd67a8 217 4186 3 0xfffffc000235a570 systat (ttyin)
0xfffffc0005cd65c0 205 4086 3 0xfffffe00107521a0 csh (pause)
0xfffffc0005cd63d8 201 4086 3 0xfffffe001074e1a0 tcsh (pause)
0xfffffc0005cd61f0 199 84 3 0xfffffc0000510528 sshd1 (select)
0xfffffc000237bc98 195 4086 3 0xfffffe00107461a0 tcsh (pause)
0xfffffc000237b6e0 192 84 3 0xfffffc0000510528 sshd1 (select)
0xfffffc000237bab0 189 4086 3 0xfffffe00107321a0 tcsh (pause)
0xfffffc000237ad58 187 84 3 0xfffffc0000510528 sshd1 (select)
0xfffffc000237a5b8 186 4086 3 0xfffffc000235a010 getty (ttyin)
0xfffffc000237b4f8 184 84 3 0xfffffc0000510528 sshd1 (select)
0xfffffc0005cd6008 175 4005 2 0 setiathome (biowait)
0xfffffc000237b8c8 163 84 3 0xfffffc0000510528 inetd (select)
0xfffffc000237b310 158 84 3 0xfffffe001072e1a0 xntpd (pause)
0xfffffc000237b128 156 184 3 0xfffffc00004ed864 cron (nanosleep)
0xfffffc000237a988 154 4 3 0xfffffe0000115220 update (lfs_dirops)
0xfffffc000237af40 126 84 3 0xfffffc0004a91750 mount_mfs (mfsidl)
0xfffffc000237ab70 122 84 3 0xfffffc0000510528 portmap (select)
0xfffffc000237a7a0 114 84 2 0 syslogd (select)
0xfffffc000237a3d0 3 20204 3 0xfffffc000050f790 reaper (reaper)
---Type <return> to continue, or q <return> to quit---
0xfffffc000237a1e8 2 20204 3 0xfffffc0000520588 pagedaemon (daemon_slp)
0xfffffc000237a000 1 4084 3 0xfffffc000237a000 init (wait)
0xfffffc000050d248 0 20204 3 0xfffffc000050d248 swapper (scheduler)
proc pid flag st wchan comm
(gdb) proc 0xfffffc0005cd6d60
(gdb) where
#0 0xfffffc0000337488 in mi_switch () at ../../../../kern/kern_synch.c:633
#1 0xfffffc0000336b00 in tsleep (ident=0x0, priority=8,
wmesg=0xfffffc0000497d1a "lfsinode", timo=0)
at ../../../../kern/kern_synch.c:371
#2 0xfffffc000032e228 in lockmgr (lkp=0xfffffc000b706180, flags=0,
interlkp=0x0) at ../../../../kern/kern_lock.c:349
#3 0xfffffc000041bdec in ufs_lock (v=0x0)
at ../../../../ufs/ufs/ufs_vnops.c:1585
#4 0xfffffc000036d024 in vn_lock (vp=0xfffffc0004a80388, flags=65538)
at ../../../../sys/vnode_if.h:724
#5 0xfffffc00003643b0 in vget (vp=0xfffffc0004a80388, flags=0)
at ../../../../kern/vfs_subr.c:872
#6 0xfffffc0000416fd4 in ufs_ihashget (dev=2054, inum=2)
at ../../../../ufs/ufs/ufs_ihash.c:111
#7 0xfffffc0000413740 in lfs_vget (mp=0xfffffe000014f800, ino=2,
vpp=0xfffffe0010761c58) at ../../../../ufs/lfs/lfs_vfsops.c:649
#8 0xfffffc000041912c in ufs_root (mp=0x0, vpp=0xfffffe0010761cc8)
at ../../../../ufs/ufs/ufs_vfsops.c:86
#9 0xfffffc0000362a14 in lookup (ndp=0xfffffe0010761db8)
at ../../../../kern/vfs_lookup.c:508
#10 0xfffffc0000362364 in namei (ndp=0xfffffe0010761db8)
at ../../../../kern/vfs_lookup.c:153
#11 0xfffffc00003675d0 in sys_statfs (p=0xfffffc0005cd6d60, v=0x0, retval=0x0)
---Type <return> to continue, or q <return> to quit---
at ../../../../kern/vfs_syscalls.c:590
#12 0xfffffc000047ffac in syscall (code=157, framep=0xfffffe0010761ef8)
at ../../../../arch/alpha/alpha/trap.c:664
#13 0xfffffc0000300540 in XentSys ()
at ../../../../arch/alpha/alpha/locore.s:581
warning: Hit heuristic-fence-post without finding
warning: enclosing function for address 0x1200036e8
This warning occurs if you are debugging a function without any symbols
(for example, in a stripped executable). In that case, you may wish to
increase the size of the search with the `set heuristic-fence-post' command.
Otherwise, you told GDB there was a function where there isn't one, or
(more likely) you have encountered a bug in GDB.
(gdb) proc 0xfffffc000237a988
(gdb) where
#0 0xfffffc0000337488 in mi_switch () at ../../../../kern/kern_synch.c:633
#1 0xfffffc0000336b00 in tsleep (ident=0x0, priority=17,
wmesg=0xfffffc0000498878 "lfs_dirops", timo=0)
at ../../../../kern/kern_synch.c:371
#2 0xfffffc0000413658 in lfs_sync (mp=0xfffffe000014f800, waitfor=2,
cred=0x0, p=0x0) at ../../../../ufs/lfs/lfs_vfsops.c:606
#3 0xfffffc0000367450 in sys_sync (p=0xfffffc000237a988, v=0x0, retval=0x0)
at ../../../../kern/vfs_syscalls.c:524
#4 0xfffffc000047ffac in syscall (code=36, framep=0xfffffe0010725ef8)
at ../../../../arch/alpha/alpha/trap.c:664
#5 0xfffffc0000300540 in XentSys ()
at ../../../../arch/alpha/alpha/locore.s:581
warning: Hit heuristic-fence-post without finding
warning: enclosing function for address 0x120000f98
(gdb) proc 0xfffffc000e66c5d0
(gdb) where
#0 0xfffffc0000337488 in mi_switch () at ../../../../kern/kern_synch.c:633
#1 0xfffffc0000336b00 in tsleep (ident=0x0, priority=8,
wmesg=0xfffffc0000497d1a "lfsinode", timo=0)
at ../../../../kern/kern_synch.c:371
#2 0xfffffc000032e228 in lockmgr (lkp=0xfffffc000ddefe70, flags=0,
interlkp=0x0) at ../../../../kern/kern_lock.c:349
#3 0xfffffc000041bdec in ufs_lock (v=0x0)
at ../../../../ufs/ufs/ufs_vnops.c:1585
#4 0xfffffc000036d024 in vn_lock (vp=0xfffffc00055402c8, flags=65538)
at ../../../../sys/vnode_if.h:724
#5 0xfffffc00003643b0 in vget (vp=0xfffffc00055402c8, flags=0)
at ../../../../kern/vfs_subr.c:872
#6 0xfffffc0000416fd4 in ufs_ihashget (dev=2054, inum=65024)
at ../../../../ufs/ufs/ufs_ihash.c:111
#7 0xfffffc0000413740 in lfs_vget (mp=0xfffffe000014f800, ino=65024,
vpp=0xfffffe00107c9b58) at ../../../../ufs/lfs/lfs_vfsops.c:649
#8 0xfffffc00004181c4 in ufs_lookup (v=0x0)
at ../../../../ufs/ufs/ufs_lookup.c:607
#9 0xfffffc0000362898 in lookup (ndp=0xfffffe00107c9db8)
at ../../../../sys/vnode_if.h:71
#10 0xfffffc0000362364 in namei (ndp=0xfffffe00107c9db8)
at ../../../../kern/vfs_lookup.c:153
#11 0xfffffc0000369a0c in sys___lstat13 (p=0xfffffc000e66c5d0, v=0x0,
---Type <return> to continue, or q <return> to quit---
retval=0x0) at ../../../../kern/vfs_syscalls.c:1603
#12 0xfffffc000047ffac in syscall (code=280, framep=0xfffffe00107c9ef8)
at ../../../../arch/alpha/alpha/trap.c:664
#13 0xfffffc0000300540 in XentSys ()
at ../../../../arch/alpha/alpha/locore.s:581
warning: Hit heuristic-fence-post without finding
warning: enclosing function for address 0x16025d574
(gdb) proc 0xfffffc0009b903e0
(gdb) where
#0 0xfffffc0000337488 in mi_switch () at ../../../../kern/kern_synch.c:633
#1 0xfffffc0000336b00 in tsleep (ident=0x0, priority=8,
wmesg=0xfffffc0000497d1a "lfsinode", timo=0)
at ../../../../kern/kern_synch.c:371
#2 0xfffffc000032e228 in lockmgr (lkp=0xfffffc000ddefe70, flags=0,
interlkp=0x0) at ../../../../kern/kern_lock.c:349
#3 0xfffffc000041bdec in ufs_lock (v=0x0)
at ../../../../ufs/ufs/ufs_vnops.c:1585
#4 0xfffffc000036d024 in vn_lock (vp=0xfffffc00055402c8, flags=131074)
at ../../../../sys/vnode_if.h:724
#5 0xfffffc000036cafc in vn_write (fp=0x0, offset=0xfffffc00023e3918,
uio=0xfffffe00107b1dd8, cred=0xfffffe000014bb00, flags=1)
at ../../../../kern/vfs_vnops.c:399
#6 0xfffffc000034258c in dofilewrite (p=0xfffffc0009b903e0, fd=3, fp=0x0,
buf=0x0, nbyte=0, offset=0x0, flags=1, retval=0xfffffe00107b1ed8)
at ../../../../kern/sys_generic.c:337
#7 0xfffffc00003424b0 in sys_write (p=0x0, v=0x0, retval=0x0)
at ../../../../kern/sys_generic.c:290
#8 0xfffffc000047ffac in syscall (code=4, framep=0xfffffe00107b1ef8)
at ../../../../arch/alpha/alpha/trap.c:664
#9 0xfffffc0000300540 in XentSys ()
at ../../../../arch/alpha/alpha/locore.s:581
warning: Hit heuristic-fence-post without finding
warning: enclosing function for address 0x120002654
(gdb)
Further details can be dug out from this crash dump, I'll hang
on to it for a while.
>How-To-Repeat:
This is a continuation of the test setup used when I reported
PR#9149. To recap:
o One loop doing "bonnie -s 2000"
o One loop doing unpacking and removal of the entire NetBSD
1.4.2 source distribution.
The first one had been going for a little more than 52 hours,
the unpack/remove loop had been going for 8 hours when the LFS
file system deadlocked on 00:40 in the morning while
/etc/daily was running.
>Fix:
Sorry, don't know.
>Audit-Trail:
>Unformatted: