tech-userlevel: Re: bin/10775: cron exits on stat failure

Subject: Re: bin/10775: cron exits on stat failure
To: Robert Elz <kre@munnari.OZ.AU>
From: Bill Studenmund <wrstuden@zembu.com>
List: tech-userlevel
Date: 08/30/2000 15:00:24
On Thu, 31 Aug 2000, Robert Elz wrote:

>     Date:        Wed, 30 Aug 2000 11:41:42 -0700 (PDT)
>     From:        Bill Studenmund <wrstuden@zembu.com>
>     Message-ID:  <Pine.NEB.4.21.0008301135300.633-100000@candlekeep.home-net.internetconnect.net>
> 
>   | There was a deficiency in how the name cache interface worked. Sometimes
>   | it would cause panics when looking up files on an NFS file system (which
>   | Bill Sommerfeld tracked down). It might also cause this problem.
> 
> The namei cache was where I was thinking the problem probably existed,
> and I'd been looking at that code in the past couple of days to see if I
> could see anything - though I had been looking at the code from -current
> rather than 1.4.1.
> 
> Other than random unrelated NFS activity, which I guess could be a trigger,
> there's no NFS involved in the actual lookups that fail here.

The problem's not so much NFS activity, but how the name cache worked. NFS
certainly would trigger the problem, and have catestrophic effects.

The problem was that the name cache returned its vnode unlocked and
unreferenced. The file system then did some checks to determine if this
was the node it wanted, and then proceeded. These checks happened with the
node unreferenced and unlocked. If the node were at the head of the free
list and the checking process had to sleep, another process could come
along, grab that vnode off of the free list, and proceed. Thus the lookup
code would have the node moved out from under it.

> I'd like to actually find a definitive cause for this, and know that it
> has been fixed, rather than that the problem "just went away".  That is,
> since the list of failures which I posted here almost a week ago, which
> were happening about twice a day, there hasn't been another one since -
> almost 6 days now.  Absolutely nothing I am aware of changed (no crontab file
> has changed, it is the same cron running the same kernel - not even a reboot.)
> Making some random change and just waiting for the problem to not occur
> again will never prove anything.   Another system here where cron used to
> fail even more frequently (when I was initially looking into this - like it
> would typically stop after 10-20 minutes) now never sees the problem at all.
> Nothing changed there that I am aware of either, between when the problem
> was there, and when it went away.  Obviously something changed ... but I
> have no idea what (it wasn't a kernel upgrade).   It has since ben upgraded,
> it was 1.3.3 when the problems occurred, and is now 1.4.2 - but the cron
> problem had vanished much earlier.
> 
> My current wild guess is that the problem relates in some way to the
> hashing scheme, and might depend upon the inode number (or perhaps v_id)
> that happens to be assigned to the /var/cron/tabs directory, and then some
> other lookup, I assume also for "tabs" is failing some other place and
> causing a bogus negative cache entry just briefly.   Or something like that.

That's kinda it, as I understand the problem. Basically the v_id gets
changed while the lookup routine is validating the returned vnode. It gets
quite confused.

> What I want to do is find that trigger - work out exactly how to make
> the problem occur by design, and make that happen.   Then fix the bug
> knowing I have a test case to demonstrate that it really is fixed.

That might be a challenge. But if it's what I think, having a low number
of vnodes will help. Good luck!

> That is unless someone can say definitively "yes, I know exactly what
> that bug was, and the change in version n.m of xxx.c fixed it", in that
> case I will just get the fix and apply it...

Bill can say more about this. The test would be to see if you could pull
in the new nami cache. But that probably won't apply cleanly.

> Otheriwse, if anyone has a suggestion as to what to try as that test case,
> I'm quite willing to do that - I don't mind how often I kill (or at least,
> have cron detect the failure and continue these days) cron on this system,
> and I have no problem building new kernels and running those (even ones
> with kernel diagnostics, perhaps printfs, installed).

The only thing I could think would be to somehow mark a few of the nodes
(/var, /var/cron, etc.) with a test flag, and have ufs_lookup sleep a
little while finding them in the name cache. I can't really think of a
better way to test vnode freelist dynamics.

Take care,

Bill