Subject: Re: bin/10775: cron exits on stat failure
To: Bill Studenmund <wrstuden@zembu.com>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-userlevel
Date: 08/31/2000 06:42:59
    Date:        Wed, 30 Aug 2000 11:41:42 -0700 (PDT)
    From:        Bill Studenmund <wrstuden@zembu.com>
    Message-ID:  <Pine.NEB.4.21.0008301135300.633-100000@candlekeep.home-net.internetconnect.net>

  | That could be it. Try 1.5.

Unfortunately, on this system, that's not so easy, upgrading to 1.5
is a major task, and this is our web server host, so it really can't
just vanish for hours just to see if a fix will help a problem which
isn't really affecting anything (while this system does a lot of stuff
from cron, none of it really matters to its purpose, it was quite
possible for cron to have been missing for a week without anyone noticing).

  | There was a deficiency in how the name cache interface worked. Sometimes
  | it would cause panics when looking up files on an NFS file system (which
  | Bill Sommerfeld tracked down). It might also cause this problem.

The namei cache was where I was thinking the problem probably existed,
and I'd been looking at that code in the past couple of days to see if I
could see anything - though I had been looking at the code from -current
rather than 1.4.1.

Other than random unrelated NFS activity, which I guess could be a trigger,
there's no NFS involved in the actual lookups that fail here.

I'd like to actually find a definitive cause for this, and know that it
has been fixed, rather than that the problem "just went away".  That is,
since the list of failures which I posted here almost a week ago, which
were happening about twice a day, there hasn't been another one since -
almost 6 days now.  Absolutely nothing I am aware of changed (no crontab file
has changed, it is the same cron running the same kernel - not even a reboot.)
Making some random change and just waiting for the problem to not occur
again will never prove anything.   Another system here where cron used to
fail even more frequently (when I was initially looking into this - like it
would typically stop after 10-20 minutes) now never sees the problem at all.
Nothing changed there that I am aware of either, between when the problem
was there, and when it went away.  Obviously something changed ... but I
have no idea what (it wasn't a kernel upgrade).   It has since ben upgraded,
it was 1.3.3 when the problems occurred, and is now 1.4.2 - but the cron
problem had vanished much earlier.

My current wild guess is that the problem relates in some way to the
hashing scheme, and might depend upon the inode number (or perhaps v_id)
that happens to be assigned to the /var/cron/tabs directory, and then some
other lookup, I assume also for "tabs" is failing some other place and
causing a bogus negative cache entry just briefly.   Or something like that.

What I want to do is find that trigger - work out exactly how to make
the problem occur by design, and make that happen.   Then fix the bug
knowing I have a test case to demonstrate that it really is fixed.

That is unless someone can say definitively "yes, I know exactly what
that bug was, and the change in version n.m of xxx.c fixed it", in that
case I will just get the fix and apply it...

Otheriwse, if anyone has a suggestion as to what to try as that test case,
I'm quite willing to do that - I don't mind how often I kill (or at least,
have cron detect the failure and continue these days) cron on this system,
and I have no problem building new kernels and running those (even ones
with kernel diagnostics, perhaps printfs, installed).

I will also take the hint and look at the differences between the name cache
system in 1.4.1 and 1.5 and see if that gives me any clues for forward
progress.   In the past 6 days I have been beating heavily on a 1.5*
system to see if I can make this happen, even once ... so far without
luck at all (it has been doing lots of cron stuff, lots of general
filesystem stuff, keeping very busy).   That may be because the problem
is fixed, or it could just be that some essential element in the trigger
conditions for this bug isn't met there (perhaps the inode number of
/var/cron/tabs isn't a susceptible one, or something - actually, that's
an idea, I should start beating on that directory making changes, and see
if that ever has any effect).   Or it could be of course that the bug really
is fixed in 1.5...

kre

ps: while looking over the -current vfs_cache.c code I noticed a bunch
of cosmetic things that ought be fixed sometime - like it probably
should be using TAILQ_NEXT() several places instead of "knowing" how
that works.   Sim for TAILQ_FIRST().   Someone should probably make
some random meaningless name changes to the various queue structure
fields (the stuff supposed to be manipulated via macros) and then go
fix everywhere in the kernel that fails to compile because of those
changes.

There's also this comment:

 * The line "nxtcp = nchhead" near the end is to avoid potential problems
 * if the cache lru chain is modified while we are dumping the
 * inode.  This makes the algorithm O(n^2), but do you think I care?

ahead of cahce_purgevfs() which accurately describes the code -- well
at least accurately describes what the code is doing, but the rationale
seems to have vanished.  I suspect that comment was one of mine (it
has my style, and it refers to inode, not vnode), way back in the early
days of the name cache code, when this particular bit of code was intertwined
with code which did other stuff (like a vput() or vrele() or something
- though at the time it would have been iput()) which could sleep.
I don't see anything like that there any more, so the "return to the head
of the list" stuff looks like meanmingless baggage now...