Subject: Re: bin/10775: cron exits on stat failure
To: Kimmo Suominen <kim@tac.nyc.ny.us>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-userlevel
Date: 08/08/2000 10:21:45
    Date:        Mon, 7 Aug 2000 18:18:16 +0300 (EEST)
    From:        Kimmo Suominen <kim@tac.nyc.ny.us>
    Message-ID:  <200008071518.SAA00878@pyry.gw.com>

  |     In src/usr.sbin/cron/database.c you'll find that cron will exit if
  |     it fails to stat the spool directory (/var/cron/tabs).  I think it
  |     should sleep and try again, and also keep processing the system
  |     crontab file (/etc/crontab).

I thought I must be going insane that I saw that happen so frequently
here and yet had heard no mention of it in the lists or PR database
at all...

  |     I guess you could remove /var/cron/tabs to repeat this.  On one of
  |     my systems the stat fails occasionally (I don't know why -- resource
  |     starvation maybe?), and cron dies.  This can go unnoticed for a few
  |     days, easily.

Yes...

On one of my systems I have (in /etc/rc - it is a 1.4.1 system, predates
the rc.d stuff) ...

if checkyesno cron; then 
        echo -n ' cron';                cron
        ( while :
        do
                sleep 77 
                if [ `ps axc | grep -i cron | wc -l` -lt 1 ]
                then
                        logger -p cron.notice -t CRON_RESTART "restarting cron"
                        cron
                fi
        done & )
fi
echo '.' 

I have had systems where cron used to repeatedly die, change nothing at
all, and it dies no longer - for now anyway, and this other system where
it quite frequently dies.   How busy the system is, or how long it has been
running doesn't seem connected.

I spent a bunch of time attempting to figure out what is going on, and got
nowhere at all, on one system that this this I had cron dumping all kinds of
status (errno values, current working directory, the buffer containing the
name it was performing the stat() of) and never saw anything at all that
could explain the problem.   That is, I convinced myself that this has to
be a kernel problem of some kind.

I suspect that simply repeating the stat() would work, no sleep needed,
but didn't ever get round to testing that theory the loop & restart technique
has been working well enough for me for now...)

kre