tech-userlevel: Re: bin/10775: cron exits on stat failure

Subject: Re: bin/10775: cron exits on stat failure
To: None <gnats-bugs@gnats.netbsd.org, tech-userlevel@netbsd.org>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-userlevel
Date: 08/09/2000 17:20:55
    Date:        Tue,  8 Aug 2000 16:08:07 -0400 (EDT)
    From:        woods@weird.com (Greg A. Woods)
    Message-ID:  <20000808200807.32B5587@proven.weird.com>

  | Which exact releases (or kernel dates) are you folks seeing thse cron
  | failures on?

The system I'm seeing this on currently is 1.4.1 - but I have seen it
happen on others (not 1.4x or 1.5*, but that's probably because I have
none of those in any kind of production use).   I may try creating some
fake load on a 1.5B system (or upgrade it to E or whatever is current)
and see if I can cause it to happen - but as the trigger conditions are
a mystery, that is not going to prove much.

  | Is there anything "special" about the filesystems on these
  | systems, such as any non-ffs filesystems,

The systems will all have NFS mounted filesystems, but not relevant to
cron in any way (ie: /var/cron is on /var which is ffs, as is root).

  | or the comings and goings of
  | other related processes, such as syslog, etc.?

Nothing else important mysteriously vanishes (that I have ever noticed
anyway).

  | Are you running NTP on
  | these machines, and are their clocks running smoothly -- i.e. are there
  | any time steps during normal operation?

I run NTP everywhere, without problems (so, no, the time isn't jumping
around, not that I think that would be relevant anyway).

  | I ask because I've never yet seen any unexplained failure of cron on any
  | kind of NetBSD system I've ever run

You don't seem to be unique in that.  I have quite a few NetBSD systems
around here, only one of them is experiencing this problem at the minute.
For all I know it might depend upon the inode number of /var/cron/tabs
or something weird like that...

  | and I do have some that are
  | running at least one job every minute.

It may just be that systems that notice this which have a reasonably heavy
cron workload also have a reasonably heavy regular workload (the system I'm
getting this on at the minute is the departmental web (apache) server...
which also runs a caching named, and a fair amount of mrtg, which is what
a lot of the cron jobs are).

  | Would it be of any help to try and find a job set that would make this
  | failure easy to reproduce, but which could run without any special
  | requirements on any arbitrary machine?

Yes, it would help to be able to reproduce it.   I really don't think
that anything that cron is doing specifically has much to do with this
(though it may be creating the environment to trigger the problem).

  | For example can even something
  | as simple just running several jobs calling /bin/true trigger the bug?

Don't know...

  | If so then I for one would be willing to install such a job set on a
  | number of my machines and try to reproduce the problem, perhaps even
  | while running cron under ktrace and/or gdb.

I've done the ktrace thing in the past - it showed nothing abnormal at all.
Well, it certainly showed the error being passed back from the kernel to
cron, but everything else looked just like it should.   I added a whole
bunch of diagnostics to cron to see if perhaps something cron was doing
might have been causing this (perhaps it does a chdir(), or a chroot(), or
something like that ... nothing found at all).   I can't see this being a
problem in cron, so I doubt that running it under gdb would help (there's
nothing interesting to see inside cron I don't think).

I'm 97% sure this is some kind of odd kernel problem, though why it affects
just cron (seemingly) I don't know - just maybe the habit cron has of
stating the same directory over and over again, in combination with
something else happening, causes the problem (maybe some namei cache
weirdness or something like that).

The next time cron needs to restart on my web server, it will start a modified
version which logs the value or errno, and then immediately attempts the
stat() again, and if that works, just continues (otherwise still exits).
Unfortunately (!!) it hasn't died again in the past couple of days - and
killing cron just to get the new one running is not what I want to do, so
I need to wait for two more occurrences of this to get any more info.

I understand the desire to make cron ignore this error and carry on anyway,
as cron exiting really is not nice - but this error really should be a fatal
error, and cron should exit if it ever happens.   What needs to be found
here is why the error is happening - not just to implement a workaround so
the error can be ignored.

If I can create an environment that will force this to happen (that is,
at least allow it to happen sometimes), then I can test it on a system I
can do real kernel debug work on and perhaps find the cause.  If I can't,
it will never get to more than guesswork (from me anyway).

kre