Subject: Re: Getting information out of crash dump
To: Bill Studenmund <wrstuden@nas.nasa.gov>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 04/26/1999 21:19:32
Bill Studenmund writes:
> On Mon, 26 Apr 1999, Chuck Silvers wrote:
> 
> > if this is the same "vrele: ref cnt" panic that was PR'd yesterday,
> > I'm looking at it too.
> 
> I've been trying to track a similar problem down for the last week or so,
> without success. Something somewhere is calling vrele on a vnode it didn't
> vget. I've seen it in the context of vnd disks dying (VOP_BMAP gets back a
> vnode with usecount == 0).
> 
> > in this dump, the rest of the stack trace is:
> > frame ptr	pc		function
> > 0xf3ef1e10	0xf0100f1d	calltrap
> > 0xf3ef1e24	0xf018ba69	panic
> > 0xf3ef1e50	0xf01a36c0	vrele+80
> > 0xf3ef1f24	0xf01a8497	rename_files+959
> > 0xf3ef1f3c	0xf01a80bd	sys_rename+21
> > 0xf3ef1fa8	0xf027ed26	syscall+526
> > 0xefbfdd64	0xf0100fc9	syscall1+31
> > 
> > this corresponds to the vrele() in this bit at the end of
> > kern/vfs_syscalls.c:rename_files()
> > 
> > out1:
> > 	if (fromnd.ni_startdir)
> > 		vrele(fromnd.ni_startdir);
> > 	FREE(fromnd.ni_cnd.cn_pnbuf, M_NAMEI);
> > 	return (error == -1 ? 0 : error);
> 
> While I think this code is weird, I think it's ok. It's been that way for
> 5 years and is that way in FreeBSD, so I doubt that's the problem. I think
> the problem is that somewhere else has vrele'd when it shouldn't, and this
> is where it got caught.

yea, I have no idea where the actual bug is, I was just pointing out
where it crashed.  further, the problem doesn't seem to be easily repeatable.


> So what will be hanging onto vnode references?
> 
> I asked Charles about it, and he pointed out that the name cache and the
> buffer cache. But the buffer cache uses hold counts (not usecount which
> hit zero here), and the name cache doesn't mess with usecount on the nodes
> it caches.
> 
> I'm asking as I've seen something really bumping the count, and I suspect
> whatever it is is getting things wrong.
> 
> 
> Here's what I saw:
> 
> I changed vndstrategy() so that it would vprint the vnode it gets back
> from VOP_BMAP. I had a filesystem on /dev/sd0d mounted at /TEST. The vnd
> is configuring /dev/vnd0d on /TEST/bill/foo. The way VOP_BMAP is designed
> to work is that it will map the file blocks (blocks in the vn device) into
> blocks on the underlying device (/dev/sd0d).
> 
> I expected that, as I had one fs mounted, the usecount would be around 1
> or 2. In some runs, I got 4, then 5, then 6. In other runs, I got numbers
> around 50, and in others, around 200. This is 200 users of /dev/sd0d!
> 
> Does anyone know what other caching would be doing this?
> 
> Take care,
> 
> Bill


one way to find this would be to put in this bit where you have the vprint():

	if (vp->v_usecount > 100) {
		Debugger();
	}

then do a backtrace and see where the refs are coming from.
"c" and get a few more backtraces to make sure they're all the same.

-Chuck