Subject: Re: bin/10775: cron exits on stat failure
To: Robert Elz <kre@munnari.OZ.AU>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 08/08/2000 16:08:07
[ On Tuesday, August 8, 2000 at 11:16:31 (+1000), Robert Elz wrote: ]
> Subject: Re: bin/10775: cron exits on stat failure 
>
> The one other thing I ought to add, is that on the systems I see this,
> cron is doing a lot of work - that is, root's crontab file would have
> something to execute just about every minute (on the system currently 
> experiencing the problem there are no other user crontab files, on the
> one where it once occurred much more often there would have been a few
> others as well, none of those requesting much work though).

Which exact releases (or kernel dates) are you folks seeing thse cron
failures on?  Is there anything "special" about the filesystems on these
systems, such as any non-ffs filesystems, or the comings and goings of
other related processes, such as syslog, etc.?  Are you running NTP on
these machines, and are their clocks running smoothly -- i.e. are there
any time steps during normal operation?

I ask because I've never yet seen any unexplained failure of cron on any
kind of NetBSD system I've ever run (or on any other system running the
same basic version of cron, for that matter) and I do have some that are
running at least one job every minute.  I've also got several mixes of
configurations, some where I've used /etc/crontab to replace root's
crontab, and others where I have not.  I wouldn't say that I've got any
really heavy usage though....  Of the machines running jobs every minute
there's only one such job and it runs from an unprivileged user's
crontab, though on the same machine there's also a five-minute job and
two ten-minute jobs, plus a couple of daily jobs.

I should point out that on all but my aging 1.3.2 sparc I've made some
changes to cron to improve its logging significantly (and to fix some
documentation, and some other things that were bugging me).  I should
probably send-pr these changes at some point, but I don't see anything
obvious that would affect its reliability.  I did add a shutdown handler
for SIGINT and SIGTERM so I could close the log properly as well as log
these events; and I applied patches from FreeBSD PR#5572.  Let me know
if you want to see these changes before I get around to PRing them.

Would it be of any help to try and find a job set that would make this
failure easy to reproduce, but which could run without any special
requirements on any arbitrary machine?  For example can even something
as simple just running several jobs calling /bin/true trigger the bug?

If so then I for one would be willing to install such a job set on a
number of my machines and try to reproduce the problem, perhaps even
while running cron under ktrace and/or gdb.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>