Subject: kern/616: NFS and union fs over SLIP causes hangs
To: None <gnats-admin@sun-lamp.cs.berkeley.edu>
From: Brad Spencer <brad@anduin.eldar.org>
List: netbsd-bugs
Date: 12/06/1994 20:20:04
>Number:         616
>Category:       kern
>Synopsis:       union mouting a NFS point, over a slip link, causes a hang
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Dec  6 20:20:02 1994
>Originator:     Brad Spencer
>Organization:
"	Right at home"
>Release:        1.0
>Environment:
	P90, NetBSD 1.0 with patches 1 to 5, i386, standard libraries
System: NetBSD anduin.eldar.org 1.0 NetBSD 1.0 (ANDUIN) #1: Mon Nov 28 22:12:04 EST 1994 brad@anduin.eldar.org:/usr/src/sys/arch/i386/compile/ANDUIN i386


>Description:
	In order to test the usability of a 9600 baud SLIP link I took
	it upon my self to compile X11R6 via NFS over said link.  I
	issue the following command:

mount -r -t nfs -o -a1,-r1024,-w1024,-i	bigmachine:/usr/local/src/X11R6 /mnt

	In order to improve performance a bit, I union mount a local
	directory on top of /mnt with the following command:

mount -t union /usr/local/tmp/X11R6/build /mnt

	To watch the progress of things, I do a 'stty status ^T' in
	all the virtual consoles involved in the compilation.

	The problem is that at somewhat ramdom times, the X11R6 make
	will hang in 'vget', according to the status returned by a ^T.
	The cause of the hang seems to be a large amount of quick
	accesses to the disk.  These accesses could be relaynews
	processing incoming mail, GNUS reading the headers from a news
	group, a 'find' or 'grep' looking for something, a tar archive
	being unpacked, etc... Not every instance will cause the hang.
	In any case, the process that causes the make to hang in
	'vget' also hangs, but in 'iowait'.  In particularly bad
	cases, any access to the disk will causes the process who did
	the access to hang in an 'iowait' state.

	In all cases, the hung process in 'vget' and any processes in
	'iowait' are unkillable by any means and a reboot is required.

	Antisocial behavior, in the least....
>How-To-Repeat:
	Well, this is might be hard, as I don't know exactly which
	part of the puzzle is broken.

>Fix:
	Unknown, but looking through /sys/kern/vfs_subr.c in the
	funcion 'vget' I see the following bit:

	/*
	 * If the vnode is in the process of being cleaned out for
	 * another use, we wait for the cleaning to finish and then
	 * return failure. Cleaning is determined either by checking
	 * that the VXLOCK flag is set, or that the use count is
	 * zero with the back pointer set to show that it has been
	 * removed from the free list by getnewvnode. The VXLOCK
	 * flag may not have been set yet because vclean is blocked in
	 * the VOP_LOCK call waiting for the VOP_INACTIVE to complete.
	 */
a.	if ((vp->v_flag & VXLOCK) ||
b.	    (vp->v_usecount == 0 &&
c.	     vp->v_freelist.tqe_prev == (struct vnode **)0xdeadb)) {
d.		vp->v_flag |= VXWANT;
e.		tsleep((caddr_t)vp, PINOD, "vget", 0);
f.		return (1);
g.	}

	The following bit is from the function 'vclean'...

h.	/*
i.	 * Done with purge, notify sleepers of the grim news.
j.	 */
k.	vp->v_op = dead_vnodeop_p;
l.	vp->v_tag = VT_NON;
m.	vp->v_flag &= ~VXLOCK;
n.	if (vp->v_flag & VXWANT) {
o.		vp->v_flag &= ~VXWANT;
p.		wakeup((caddr_t)vp);
q.	}

	I guess I find these two bits interesting, is it at all
	possible that a race between these two is happening??  In
	particular, the 'if' at a,b, and c and the set at d racing
	with the 'if' at n.  Could the first 'if' execute, but not the
	set of VXWAIT, then the second 'if' executes, then the set of
	VXWAIT executes and the 'tsleep', but since the second 'if'
	has already checked the VXWAIT flag, the 'wakeup' isn't
	executed.

	This is probably all bunk......

>Audit-Trail:
>Unformatted: