Subject: kern/8863: filesystem lookup under heavy LFS write
To: None <gnats-bugs@gnats.netbsd.org>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: netbsd-bugs
Date: 11/23/1999 12:31:13
>Number: 8863
>Category: kern
>Synopsis: filesystem lookup under heavy LFS write
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people (Kernel Bug People)
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Nov 23 12:30:00 1999
>Last-Modified:
>Originator: Manuel Bouyer
>Organization:
none
>Release: -current as of yesterday
>Environment:
System: NetBSD rochebonne.antioche.eu.org 1.4P NetBSD 1.4P (ROCHEBONNE) #0: Mon Nov 22 20:33:51 MET 1999 bouyer@rochebonne.antioche.eu.org:/usr/src/1.4M/src/sys/arch/i386/compile/ROCHEBONNE i386
rochebonne#/home/bouyer>mount
/dev/wd0a on / type ffs (NFS exported, local)
/dev/wd0e on /usr type ffs (NFS exported, local)
/dev/sd0g on /home type ffs (NFS exported, local)
/dev/wd0g on /usr/src type lfs (local)
mfs:161 on /tmp type mfs (asynchronous, local)
pid192@rochebonne:/auto on /auto type nfs
/dev/sd1d on /magn type ffs (asynchronous, NFS exported, local, read-only)
This system is configured with a large buffer cache:
total memory = 65148 KB
avail memory = 41364 KB
using 2822 buffers containing 20000 KB of memory
kernel compiled without SOFTDEP.
wd0 is a realy fast HD (> 22MB/s on the raw device), the CPU (Cyrix 133Mhz)
has difficulties keeping up.
>Description:
After a frech newfs_lfs of /usr/src (without any special options)
I was running 'zcat /magn/1.4P.tgz | tar xf -' in /usr/src (1.4P.tgz
being the sources tarballs of today's -current). 'systat vm' shows
that the CPU is busy (less than 1% idle), and spending more than
50% in the kernel. wd0 gets about 2MB/s, disk is clearly not the
bottleneck. 'top' show zcat, tar and lfs_cleanerd being the processes
eating CPU (lfs_cleanerd for about 1%, but constantly running).
after a while all activity stopped, any access to the filesystem
hangs the process. I've been able to enter the debugger and get a
core dump. Following are infos I've been able to get from it.
rochebonne#/var/crash>ps -M netbsd.3.core -axl
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
0 0 173 0 -18 0 0 0 schedu DLs ?? 0:00.01 (swapper)
0 1 173 0 10 0 236 0 wait Is ?? 0:00.03 (init)
0 2 173 0 -18 0 0 0 daemon DL ?? 0:00.01 (pagedaemon)
0 3 173 0 -18 0 0 0 reaper DL ?? 0:00.03 (reaper)
0 4 173 0 -5 0 0 0 lfs_di DL ?? 0:00.10 (ioflush)
0 150 173 0 2 0 92 0 - Rs ?? 0:00.10 (syslogd)
0 154 173 25 2 0 60 0 select Is ?? 0:00.05 (portmap)
0 161 173 0 10 0 20936 0 mfsidl Is ?? 0:00.03 (mount_mfs)
0 165 173 0 2 0 116 0 select Is ?? 0:00.17 (mountd)
0 167 173 18 2 0 40 0 netcon Is ?? 0:00.02 (nfsd)
0 171 173 25 10 0 16 0 nfsidl IL ?? 0:00.00 (nfsiod)
0 172 173 25 10 0 16 0 nfsidl IL ?? 0:00.00 (nfsiod)
0 173 173 25 10 0 16 0 nfsidl IL ?? 0:00.00 (nfsiod)
0 174 173 25 10 0 16 0 nfsidl IL ?? 0:00.00 (nfsiod)
0 177 173 25 2 0 20 0 nfsd IL ?? 0:00.01 (nfsd)
0 178 173 25 2 0 20 0 nfsd IL ?? 0:00.01 (nfsd)
0 179 173 25 2 0 20 0 nfsd IL ?? 0:00.01 (nfsd)
0 181 173 25 2 0 20 0 nfsd IL ?? 0:00.00 (nfsd)
0 192 173 0 2 0 260 0 select Ss ?? 0:00.02 (amd)
0 219 173 0 -14 0 220 0 ufs_ha Ds ?? 0:00.02 (cron)
0 222 173 25 2 0 212 0 select Is ?? 0:00.01 (dhcpd)
0 225 173 25 2 0 64 0 select Is ?? 0:00.01 (inetd)
0 231 173 19 2 0 240 0 select Is ?? 0:02.89 (sshd1)
0 257 173 0 10 0 44 0 wait Is ?? 0:00.00 (lfs_cleanerd
0 258 173 0 28 0 576 0 segmen I ?? 0:01.73 (lfs_cleanerd
1210 233 173 0 18 0 524 0 pause Is E0 0:00.22 (tcsh)
1210 266 173 20 2 0 500 0 netio I+ E0 0:47.33 (zcat)
1210 267 173 27 -5 0 588 0 lfs wr D+ E0 1:00.80 (tar)
0 234 173 2 18 0 328 0 pause Is E1 0:00.12 (csh)
0 245 173 0 18 0 548 0 pause I E1 0:00.16 (tcsh)
0 265 173 0 3 0 292 0 ttyin S+ E1 0:00.46 (systat)
0 235 173 0 18 0 328 0 pause Is E2 0:00.12 (csh)
0 270 173 0 2 0 204 0 select S+ E2 0:00.56 (top)
0 236 173 17 3 0 48 0 ttyin Is+ E3 0:00.03 (getty)
0 237 173 17 3 0 48 0 ttyin Is+ E4 0:00.03 (getty)
0 238 173 17 3 0 48 0 ttyin Is+ E5 0:00.03 (getty)
0 239 173 17 3 0 48 0 ttyin Is+ E6 0:00.03 (getty)
Note that tar, lfs_cleanerd and ioflush are all waiting in lfs code.
Other hung processes seems to wait on ufs_halock (verified in another case
I reproduced).
stack trace for lfs_cleanerd (pid 258):
(gdb) proc 0xfdf78c80
(gdb) where
#0 mi_switch () at ../../../../kern/kern_synch.c:748
#1 0xf0136495 in bpendtsleep () at ../../../../kern/kern_synch.c:415
#2 0xf01d9582 in sys_lfs_segwait (p=0xfdf78c80, v=0xfe062f88,
retval=0xfe062f80) at ../../../../ufs/lfs/lfs_syscalls.c:862
#3 0xf01fa86e in syscall (frame={tf_es = 134479903, tf_ds = -272695265,
tf_edi = -272638948, tf_esi = -272638956, tf_ebp = -272638936,
tf_ebx = 134549504, tf_edx = 1835, tf_ecx = 3, tf_eax = 187,
tf_trapno = 3, tf_err = 2, tf_eip = 1208583703, tf_cs = 23,
tf_eflags = 663, tf_esp = -272638996, tf_ss = 31, tf_vm86_es = 0,
tf_vm86_ds = 0, tf_vm86_fs = 0, tf_vm86_gs = 0})
at ../../../../arch/i386/i386/trap.c:760
#4 0xf0100d7d in syscall1 ()
can not access 0xefbfdc28, invalid translation (invalid PDE)
can not access 0xefbfdc28, invalid translation (invalid PDE)
Cannot access memory at address 0xefbfdc28.
stack trace for tar (pid 267):
#0 mi_switch () at ../../../../kern/kern_synch.c:748
#1 0xf0136495 in bpendtsleep () at ../../../../kern/kern_synch.c:415
#2 0xf01d64ff in lfs_segwrite (mp=0xf04d4000, flags=5)
at ../../../../ufs/lfs/lfs_segment.c:501
#3 0xf01d611c in lfs_vflush (vp=0xfe1dd700)
at ../../../../ufs/lfs/lfs_segment.c:253
#4 0xf01d4ae3 in lfs_update (v=0xfe0beac4)
at ../../../../ufs/lfs/lfs_inode.c:178
#5 0xf01db0d3 in lfs_fsync (v=0xfe0beb04) at ../../../../sys/vnode_if.h:1027
#6 0xf01542e7 in vinvalbuf (vp=0xfe1dd700, flags=1, cred=0xffffffff,
p=0xfe096c88, slpflag=0, slptimeo=0) at ../../../../sys/vnode_if.h:449
#7 0xf0154ea5 in vclean (vp=0xfe1dd700, flags=8, p=0xfe096c88)
at ../../../../kern/vfs_subr.c:1349
#8 0xf0155083 in vgonel (vp=0xfe1dd700, p=0xfe096c88)
at ../../../../kern/vfs_subr.c:1479
#9 0xf0154026 in getnewvnode (tag=VT_LFS, mp=0xf04d4000, vops=0xf0451c00,
vpp=0xfe0bec30) at ../../../../kern/vfs_subr.c:503
#10 0xf01d367e in lfs_vcreate (mp=0xf04d4000, ino=33107, vpp=0xfe0bec30)
at ../../../../ufs/lfs/lfs_alloc.c:255
#11 0xf01d357b in lfs_valloc (v=0xfe0bec90)
at ../../../../ufs/lfs/lfs_alloc.c:201
#12 0xf01e1792 in ufs_makeinode (mode=33188, dvp=0xfe1db09c, vpp=0xfe0beee8,
cnp=0xfe0beefc) at ../../../../sys/vnode_if.h:933
#13 0xf01deb7a in ufs_create (v=0xfe0bee08)
at ../../../../ufs/ufs/ufs_vnops.c:118
#14 0xf01db462 in lfs_create (v=0xfe0bee08)
at ../../../../ufs/lfs/lfs_vnops.c:449
#15 0xf015ae28 in vn_open (ndp=0xfe0beed8, fmode=1550, cmode=420)
at ../../../../sys/vnode_if.h:96
#16 0xf01573a0 in sys_open (p=0xfe096c88, v=0xfe0bef88, retval=0xfe0bef80)
at ../../../../kern/vfs_syscalls.c:999
#17 0xf01fa86e in syscall (frame={tf_es = 31, tf_ds = 31, tf_edi = 0,
tf_esi = 134813696, tf_ebp = -272639140, tf_ebx = 1549, tf_edx = -1,
tf_ecx = 2, tf_eax = 5, tf_trapno = 3, tf_err = 2, tf_eip = 134684535,
tf_cs = 23, tf_eflags = 514, tf_esp = -272639504, tf_ss = 31,
tf_vm86_es = 0, tf_vm86_ds = 0, tf_vm86_fs = 0, tf_vm86_gs = 0})
at ../../../../arch/i386/i386/trap.c:760
#18 0xf0100d7d in syscall1 ()
can not access 0xefbfdb5c, invalid translation (invalid PDE)
can not access 0xefbfdb5c, invalid translation (invalid PDE)
Cannot access memory at address 0xefbfdb5c.
stack trace for ioflush (pid 4):
#0 mi_switch () at ../../../../kern/kern_synch.c:748
#1 0xf0136495 in bpendtsleep () at ../../../../kern/kern_synch.c:415
#2 0xf01da5ab in lfs_sync (mp=0xf04d4000, waitfor=3, cred=0xf044cf80,
p=0xfdf783c0) at ../../../../ufs/lfs/lfs_vfsops.c:616
#3 0xf015e957 in sync_fsync (v=0xfdf84f80)
at ../../../../miscfs/syncfs/sync_vnops.c:136
#4 0xf015ec61 in sched_sync (v=0x0) at ../../../../sys/vnode_if.h:449
#5 0xf0100327 in proc_trampoline ()
#6 0xf02c6ed8 in ?? ()
>How-To-Repeat:
On my system, untarring a large source tree to a LFS filesystem
will reproduce it reliably, after variable time (but always before
extract is done).
>Fix:
Sounds like a deadlock condition between lfs_sync() and another part of
the LFS code. Unfortunably I can track this down better for now
(doing so would require for me to understand the LFS code first which
is require more time than I can spend on it) but I can test
any patch or provide any additionnal informations.
I can easily reproduce the problem.
>Audit-Trail:
>Unformatted: