Subject: kern/1354: union FS can deadlock while vgone()/vclean() is running
To: None <gnats-bugs@gnats.netbsd.org, jsp@sequent.com>
From: John Kohl <jtk@kolvir.arlington.ma.us>
List: netbsd-bugs
Date: 08/13/1995 20:42:57
>Number:         1354
>Category:       kern
>Synopsis:       union FS can deadlock while vgone()/vclean() is running
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Aug 13 20:50:02 1995
>Last-Modified:
>Originator:     John Kohl
>Organization:
NetBSD Kernel Hackers `R` Us
>Release:        NetBSD-current as of 12 Aug 1995
>Environment:
	
System: NetBSD pattern 1.0A NetBSD 1.0A (PATTERN) #67: Sat Aug 12 23:17:42 EDT 1995 jtk@pattern:/u1/NetBSD-current/src/sys/arch/i386/compile/PATTERN i386


>Description:

I was doing a kernel build (makedepend) and a 'make install' in
/usr/src/include, and got the union file system deadlocked on itself.

/usr/src is a union mount; the top layer is UFS and the bottom layer
is NFS.

Here are kernel stack traces of the deadlocked processes, plus some
analysis and vnode printouts:

The processes (ddb output):

 2585 f882a600 f9e5c000   0     1  2190  004006  3  netbsd  find   f87b8e00
 2577 f880cf00 f9e56000 5509     1  2135  004006  3  netbsd  cpp   vget f8798500

process 2577's trace follows.  It is waiting in vget for VXWANT to
clear, i.e. the vnode to finish being recycled.  Note that in the
printout of the union vnode, 2577 has the lock on the upper layer node
( because it was locked in union_lookup1() ).  2577 was looking up
"../../../../net/radix.h, and was in the process of looking up the "net"
component.

2585 is waiting for the top-layer (UFS) node lock.  
Although not shown here, the union node is recorded as locked by PID
2585, with flags 0x12 == UN_LOCKED|UN_CACHED.

db> trace 0xf9e57bf4
bpendtsleep(f8826600,0,f8807980,0,f9e57c30) at bpendtsleep
bpendtsleep(f8798500,8,f812dd7d,0,f8826600) at bpendtsleep
_vget(f8798500,0) at _vget+0x35
_union_allocvp(f9e57ef4,f877d200,f87bed80,f87c4e80,f9e57f08) at _union_allocvp+
0x120
_union_lookup(f9e57d74,f880ab80,f9e57f08,f9e57ee4,f87a5820) at _union_lookup+0x
493
_lookup(f9e57ee4,0,1d0e0,f9e57ee4,f9e57e14) at _lookup+0x257
_namei(f9e57ee4,0,1d0e0,f8843540,f8847000) at _namei+0x16b
_vn_open(f9e57ee4,1,1a4,f7bfd0a4,f880cf00) at _vn_open+0x17c
_open(f880cf00,f9e57f84,f9e57f7c,0,1d0e0) at _open+0x91
_syscall() at _syscall+0x239
--- syscall (number 5) ---
db> call vprint(0, 0xf8798500)
type VDIR, usecount 0, writecount 0, refcount 0, flags (VXLOCK|VXWANT)
                tag VT_UNION, vp=f8798500, uppervp=f8770700, lowervp=f8807980
uppervp: type VDIR, usecount 2, writecount 0, refcount 0,
        tag VT_UFS, ino 80688, on dev 4, 5 (LOCKED)
        owner pid 2577 waiting pid 2585
lowervp: type VDIR, usecount 2, writecount 0, refcount 0,
        tag VT_NFS, fileid 176651 fsid 0x1b01

process 2585's trace follows. It was doing a lookup in a different
filesystem (/usr) and decided to recycle a union vnode.  Inside
vclean(), there's a call to VOP_LOCK(vp) before vinvalbuf() is called.
This is consistent with the notation in the union node that 2585 owns
the union node lock.  However, 2585 does NOT have the top-layer lock,
because inside union_lock() there's this code:

		if (((un->un_flags & UN_ULOCK) == 0) &&
		    (vp->v_usecount != 0)) {
			VOP_LOCK(un->un_uppervp);
			un->un_flags |= UN_ULOCK;
		}

usecount is indeed 0 in this case--that's why it's on the freelist.

db> trace *0xf9e5c03c
bpendsleep(f87b8e00,f8770700,f9e5db60,f8187626,f87b8e00) at bpendsleep
bpendsleep(f87b8e00,8) at bpendsleep
_ufs_lock(f9e5db70,f8770700,f81d36ec,f8770700,f9e5dbb0) at _ufs_lock+0x22
_union_vptofh(f8826600) at _union_vptofh+0x35    << really union_fixup
_union_fsync(f9e5dbcc,f8798500,f9e5dc24,0,2) at _union_fsync+0x64
_vinvalbuf(f8798500,1,ffffffff,0,0,0) at _vinvalbuf+0x46
_vclean(f8798500,8) at _vclean+0x6d
_vgone(f8798500) at _vgone+0x34
_getnewvnode(1,f876a000,f8726100,f9e5dcc0,f92b702c) at _getnewvnode+0xdb
_ffs_vget(f876a000,1004,f9e5dd44,f87be780,f9e5de60) at _ffs_vget+0x4f
_ufs_lookup(f9e5dda8,f87be780,f9e5de60,f9e5de3c,f87dcc20) at _ufs_lookup+0xbb1
_lookup(f9e5de3c,f7bfd920,f882a600,8,0) at _lookup+0x257
_namei(f9e5de3c,f7bfd920,f882a600,8,14c40) at _namei+0x16b
_lstat(f882a600,f9e5df84,f9e5df7c,0,14c4c) at _lstat+0x4a
_syscall() at _syscall+0x239

The basic scenario is this, I think:

2577 starts looking up a name in the union FS.  It looks up the top
layer name, finds it's a directory, and takes the lock on it.
It then starts looking up the lower name, and blocks (maybe waiting for
the lock on the lowerdvp).

2585 calls getnewvnode(), and starts reclaim on the union directory node
for the name that 2577 is about to use.  It calls VOP_LOCK() which sets
the locked bits in the union layer but doesn't take the uppervp lock.
It then calls vinvalbuf() which calls VOP_FSYNC() which sees the missing
uppervp lock and tries to get the lock.  It blocks because 2577 has the
lock.

2577 then finishes its lookup in the lower file system.  It then calls
union_allocvp() while holding uppervp locked.  union_allocvp() finds the
union node on the hash chains, and tries to vget() the node.  vget
notices the 0xdeadb marker in the vnode free chain, and waits for the
vnode to be finished recycling.

voila, 2-process deadly embrace.

I'm guessing that the (vp->v_usecount != 0) check is inside union_lock()
to avoid a related deadlock--the vclean() process would mark the vnode
for cleaning but then block in VOP_LOCK() waiting for the uppervp lock,
while the union_allocvp() process would be in vget() waiting for the
vnode recycling to be completed.

I think something needs to change in the way union_allocvp() calls
vget.  It really can't hold the uppervp lock while calling vget on the
union node (because that can lead to deadlock as seen in this case), nor
can it comfortably drop/regain the uppervp lock (although it only needs
to do this for the cases where the union node is on the free list).

>How-To-Repeat:
be very unlucky, or just try to run several processes at once doing
builds inside a union FS.

>Fix:

>Audit-Trail:
>Unformatted: