Subject: daily crashes with 1.6.1
To: None <email@example.com>
From: Tim Middleton <x@Vex.Net>
Date: 07/04/2003 11:20:18
Until a few weeks ago a box of ours running 1.5.3 was very stable. Since
upgrading to 1.6.1 it crashes several times a day.We have 1.6.1 on some boxes
which are stable, and have not been able to determine the cause of
the instability on this one particular box. The main difference between
this box and the others is that it is an NFS server. One of the difficulties
we have in determining the exact nature of the crash is that the box is
administered remotely for the most part, and so no one is there when it
locks up. There seems to be no core dump. We reboot the machine remotely via
APC. We need the machine to be up, to serve files, so we have not been able
to take it offline and examine it more closely.We have taken to manually
rebooting the machine several times a day, so at least it reboots cleanly.
Several times the master.passwd file has been corrupt after rebooting, and had
to be restored... interestingly it has been corrupt in the exact same way
each time... overwritten by a chunk of our named.conf. Both the password file
and named.conf are re-written freqently by a cron script often on this
system... so it would seem to indicate that whatever is the problem it may be
triggered to file writing. (We were paranoid that something had gone wrong
with our scripts, causing them to overwrite the master.passwd file somehow...
but, overwriting the master.passwd file would not cause a box to lock to the
point of not responding at all to pings, would it? And also we disabled those
cron scripts, and the box still eventually locked up... though at least the
password files were not corrupt in these cases).
Also we're not sure how this could be related to our current prime suspect,
NFS, as the password files are not on a NFS related partition.
Even despite the severity of the current situation, we are reluctant to revert
to 1.5.3, due to other fixes and features now in 1.6.1.
We are actually tracking and running -current now for about a week, in hopes
that some new work might fix the problem (and not cause more problems), but
so far our situation has not changed.
Does anyone have any suggestions regarding how we might best approach
determining what exactly the problem is, so hopefully to be able to fix it,
or provide information back to the NetBSD project which may be useful.
Tim Middleton | Cain Gang Ltd | A man is rich in proportion to the number of
x@veX.net | www.Vex.Net | things which he can afford to let alone. HDT