Subject: kern/8863: filesystem lookup under heavy LFS write
To: None <gnats-bugs@gnats.netbsd.org>
From: Manuel Bouyer <bouyer@antioche.lip6.fr>
List: netbsd-bugs
Date: 11/23/1999 12:31:13
>Number:         8863
>Category:       kern
>Synopsis:       filesystem lookup under heavy LFS write
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Nov 23 12:30:00 1999
>Last-Modified:
>Originator:     Manuel Bouyer
>Organization:

none

>Release:        -current as of yesterday
>Environment:
	
System: NetBSD rochebonne.antioche.eu.org 1.4P NetBSD 1.4P (ROCHEBONNE) #0: Mon Nov 22 20:33:51 MET 1999 bouyer@rochebonne.antioche.eu.org:/usr/src/1.4M/src/sys/arch/i386/compile/ROCHEBONNE i386

rochebonne#/home/bouyer>mount
/dev/wd0a on / type ffs (NFS exported, local)
/dev/wd0e on /usr type ffs (NFS exported, local)
/dev/sd0g on /home type ffs (NFS exported, local)
/dev/wd0g on /usr/src type lfs (local)
mfs:161 on /tmp type mfs (asynchronous, local)
pid192@rochebonne:/auto on /auto type nfs
/dev/sd1d on /magn type ffs (asynchronous, NFS exported, local, read-only)

This system is configured with a large buffer cache:
total memory = 65148 KB
avail memory = 41364 KB
using 2822 buffers containing 20000 KB of memory
kernel compiled without SOFTDEP.

wd0 is a realy fast HD (> 22MB/s on the raw device), the CPU (Cyrix 133Mhz)
has difficulties keeping up.

>Description:
	
	After a frech newfs_lfs of /usr/src (without any special options)
	I was running 'zcat /magn/1.4P.tgz | tar xf -' in /usr/src (1.4P.tgz
	being the sources tarballs of today's -current). 'systat vm' shows
	that the CPU is busy (less than 1% idle), and spending more than
	50% in the kernel. wd0 gets about 2MB/s, disk is clearly not the
	bottleneck. 'top' show zcat, tar and lfs_cleanerd being the processes
	eating CPU (lfs_cleanerd for about 1%, but constantly running).

	after a while all activity stopped, any access to the filesystem
	hangs the process. I've been able to enter the debugger and get a
	core dump. Following are infos I've been able to get from it.

rochebonne#/var/crash>ps -M netbsd.3.core -axl
  UID   PID  PPID CPU PRI NI   VSZ  RSS WCHAN  STAT TT       TIME COMMAND
    0     0   173   0 -18  0     0    0 schedu DLs  ??    0:00.01 (swapper)
    0     1   173   0  10  0   236    0 wait   Is   ??    0:00.03 (init)
    0     2   173   0 -18  0     0    0 daemon DL   ??    0:00.01 (pagedaemon)
    0     3   173   0 -18  0     0    0 reaper DL   ??    0:00.03 (reaper)
    0     4   173   0  -5  0     0    0 lfs_di DL   ??    0:00.10 (ioflush)
    0   150   173   0   2  0    92    0 -      Rs   ??    0:00.10 (syslogd)
    0   154   173  25   2  0    60    0 select Is   ??    0:00.05 (portmap)
    0   161   173   0  10  0 20936    0 mfsidl Is   ??    0:00.03 (mount_mfs)
    0   165   173   0   2  0   116    0 select Is   ??    0:00.17 (mountd)
    0   167   173  18   2  0    40    0 netcon Is   ??    0:00.02 (nfsd)
    0   171   173  25  10  0    16    0 nfsidl IL   ??    0:00.00 (nfsiod)
    0   172   173  25  10  0    16    0 nfsidl IL   ??    0:00.00 (nfsiod)
    0   173   173  25  10  0    16    0 nfsidl IL   ??    0:00.00 (nfsiod)
    0   174   173  25  10  0    16    0 nfsidl IL   ??    0:00.00 (nfsiod)
    0   177   173  25   2  0    20    0 nfsd   IL   ??    0:00.01 (nfsd)
    0   178   173  25   2  0    20    0 nfsd   IL   ??    0:00.01 (nfsd)
    0   179   173  25   2  0    20    0 nfsd   IL   ??    0:00.01 (nfsd)
    0   181   173  25   2  0    20    0 nfsd   IL   ??    0:00.00 (nfsd)
    0   192   173   0   2  0   260    0 select Ss   ??    0:00.02 (amd)
    0   219   173   0 -14  0   220    0 ufs_ha Ds   ??    0:00.02 (cron)
    0   222   173  25   2  0   212    0 select Is   ??    0:00.01 (dhcpd)
    0   225   173  25   2  0    64    0 select Is   ??    0:00.01 (inetd)
    0   231   173  19   2  0   240    0 select Is   ??    0:02.89 (sshd1)
    0   257   173   0  10  0    44    0 wait   Is   ??    0:00.00 (lfs_cleanerd
    0   258   173   0  28  0   576    0 segmen I    ??    0:01.73 (lfs_cleanerd
 1210   233   173   0  18  0   524    0 pause  Is   E0    0:00.22 (tcsh)
 1210   266   173  20   2  0   500    0 netio  I+   E0    0:47.33 (zcat)
 1210   267   173  27  -5  0   588    0 lfs wr D+   E0    1:00.80 (tar)
    0   234   173   2  18  0   328    0 pause  Is   E1    0:00.12 (csh)
    0   245   173   0  18  0   548    0 pause  I    E1    0:00.16 (tcsh)
    0   265   173   0   3  0   292    0 ttyin  S+   E1    0:00.46 (systat)
    0   235   173   0  18  0   328    0 pause  Is   E2    0:00.12 (csh)
    0   270   173   0   2  0   204    0 select S+   E2    0:00.56 (top)
    0   236   173  17   3  0    48    0 ttyin  Is+  E3    0:00.03 (getty)
    0   237   173  17   3  0    48    0 ttyin  Is+  E4    0:00.03 (getty)
    0   238   173  17   3  0    48    0 ttyin  Is+  E5    0:00.03 (getty)
    0   239   173  17   3  0    48    0 ttyin  Is+  E6    0:00.03 (getty)

Note that tar, lfs_cleanerd and ioflush are all waiting in lfs code.
Other hung processes seems to wait on ufs_halock (verified in another case
I reproduced).

stack trace for lfs_cleanerd (pid 258):
(gdb) proc 0xfdf78c80
(gdb) where
#0  mi_switch () at ../../../../kern/kern_synch.c:748
#1  0xf0136495 in bpendtsleep () at ../../../../kern/kern_synch.c:415
#2  0xf01d9582 in sys_lfs_segwait (p=0xfdf78c80, v=0xfe062f88, 
    retval=0xfe062f80) at ../../../../ufs/lfs/lfs_syscalls.c:862
#3  0xf01fa86e in syscall (frame={tf_es = 134479903, tf_ds = -272695265, 
      tf_edi = -272638948, tf_esi = -272638956, tf_ebp = -272638936, 
      tf_ebx = 134549504, tf_edx = 1835, tf_ecx = 3, tf_eax = 187, 
      tf_trapno = 3, tf_err = 2, tf_eip = 1208583703, tf_cs = 23, 
      tf_eflags = 663, tf_esp = -272638996, tf_ss = 31, tf_vm86_es = 0, 
      tf_vm86_ds = 0, tf_vm86_fs = 0, tf_vm86_gs = 0})
    at ../../../../arch/i386/i386/trap.c:760
#4  0xf0100d7d in syscall1 ()
can not access 0xefbfdc28, invalid translation (invalid PDE)
can not access 0xefbfdc28, invalid translation (invalid PDE)
Cannot access memory at address 0xefbfdc28.

stack trace for tar (pid 267):
#0  mi_switch () at ../../../../kern/kern_synch.c:748
#1  0xf0136495 in bpendtsleep () at ../../../../kern/kern_synch.c:415
#2  0xf01d64ff in lfs_segwrite (mp=0xf04d4000, flags=5)
    at ../../../../ufs/lfs/lfs_segment.c:501
#3  0xf01d611c in lfs_vflush (vp=0xfe1dd700)
    at ../../../../ufs/lfs/lfs_segment.c:253
#4  0xf01d4ae3 in lfs_update (v=0xfe0beac4)
    at ../../../../ufs/lfs/lfs_inode.c:178
#5  0xf01db0d3 in lfs_fsync (v=0xfe0beb04) at ../../../../sys/vnode_if.h:1027
#6  0xf01542e7 in vinvalbuf (vp=0xfe1dd700, flags=1, cred=0xffffffff, 
    p=0xfe096c88, slpflag=0, slptimeo=0) at ../../../../sys/vnode_if.h:449
#7  0xf0154ea5 in vclean (vp=0xfe1dd700, flags=8, p=0xfe096c88)
    at ../../../../kern/vfs_subr.c:1349
#8  0xf0155083 in vgonel (vp=0xfe1dd700, p=0xfe096c88)
    at ../../../../kern/vfs_subr.c:1479
#9  0xf0154026 in getnewvnode (tag=VT_LFS, mp=0xf04d4000, vops=0xf0451c00, 
    vpp=0xfe0bec30) at ../../../../kern/vfs_subr.c:503
#10 0xf01d367e in lfs_vcreate (mp=0xf04d4000, ino=33107, vpp=0xfe0bec30)
    at ../../../../ufs/lfs/lfs_alloc.c:255
#11 0xf01d357b in lfs_valloc (v=0xfe0bec90)
    at ../../../../ufs/lfs/lfs_alloc.c:201
#12 0xf01e1792 in ufs_makeinode (mode=33188, dvp=0xfe1db09c, vpp=0xfe0beee8, 
    cnp=0xfe0beefc) at ../../../../sys/vnode_if.h:933
#13 0xf01deb7a in ufs_create (v=0xfe0bee08)
    at ../../../../ufs/ufs/ufs_vnops.c:118
#14 0xf01db462 in lfs_create (v=0xfe0bee08)
    at ../../../../ufs/lfs/lfs_vnops.c:449
#15 0xf015ae28 in vn_open (ndp=0xfe0beed8, fmode=1550, cmode=420)
    at ../../../../sys/vnode_if.h:96
#16 0xf01573a0 in sys_open (p=0xfe096c88, v=0xfe0bef88, retval=0xfe0bef80)
    at ../../../../kern/vfs_syscalls.c:999
#17 0xf01fa86e in syscall (frame={tf_es = 31, tf_ds = 31, tf_edi = 0, 
      tf_esi = 134813696, tf_ebp = -272639140, tf_ebx = 1549, tf_edx = -1, 
      tf_ecx = 2, tf_eax = 5, tf_trapno = 3, tf_err = 2, tf_eip = 134684535, 
      tf_cs = 23, tf_eflags = 514, tf_esp = -272639504, tf_ss = 31, 
      tf_vm86_es = 0, tf_vm86_ds = 0, tf_vm86_fs = 0, tf_vm86_gs = 0})
    at ../../../../arch/i386/i386/trap.c:760
#18 0xf0100d7d in syscall1 ()
can not access 0xefbfdb5c, invalid translation (invalid PDE)
can not access 0xefbfdb5c, invalid translation (invalid PDE)
Cannot access memory at address 0xefbfdb5c.

stack trace for ioflush (pid 4):
#0  mi_switch () at ../../../../kern/kern_synch.c:748
#1  0xf0136495 in bpendtsleep () at ../../../../kern/kern_synch.c:415
#2  0xf01da5ab in lfs_sync (mp=0xf04d4000, waitfor=3, cred=0xf044cf80, 
    p=0xfdf783c0) at ../../../../ufs/lfs/lfs_vfsops.c:616
#3  0xf015e957 in sync_fsync (v=0xfdf84f80)
    at ../../../../miscfs/syncfs/sync_vnops.c:136
#4  0xf015ec61 in sched_sync (v=0x0) at ../../../../sys/vnode_if.h:449
#5  0xf0100327 in proc_trampoline ()
#6  0xf02c6ed8 in ?? ()


>How-To-Repeat:
	
	On my system, untarring a large source tree to a LFS filesystem
	will reproduce it reliably, after variable time (but always before
	extract is done).

>Fix:
	Sounds like a deadlock condition between lfs_sync() and another part of
	the LFS code. Unfortunably I can track this down better for now
	(doing so would require for me to understand the LFS code first which
	is require more time than I can spend on it) but I can test
	any patch or provide any additionnal informations.
	I can easily reproduce the problem.
>Audit-Trail:
>Unformatted: