Subject: daily crashes with 1.6.1
To: None <>
From: Tim Middleton <x@Vex.Net>
List: current-users
Date: 07/04/2003 11:20:18
Until a few weeks ago a box of ours running 1.5.3 was very stable. Since 
upgrading to 1.6.1 it crashes several times a day.We have 1.6.1 on some boxes 
which are stable, and have not been able to determine the cause of
the instability on this one particular box. The main difference between
this box and the others is that it is an NFS server. One of the difficulties 
we have in determining the exact nature of the crash is that the box is
administered remotely for the most part, and so no one is there when it
locks up. There seems to be no core dump. We reboot the machine remotely via 
APC. We need the machine to be up, to serve files, so we have not been able 
to take it offline and examine it more closely.We have taken to manually 
rebooting the machine several times a day, so at least it reboots cleanly.

Several times the master.passwd file has been corrupt after rebooting, and had 
to be restored... interestingly it has been corrupt in the exact same way 
each time... overwritten by a chunk of our named.conf. Both the password file 
and named.conf are re-written freqently by a cron script often on this 
system... so it would seem to indicate that whatever is the problem it may be 
triggered to file writing. (We were paranoid that something had gone wrong 
with our scripts, causing them to overwrite the master.passwd file somehow... 
but, overwriting the master.passwd file would not cause a box to lock to the 
point of not responding at all to pings, would it? And also we disabled those 
cron scripts, and the box still eventually locked up... though at least the 
password files were not corrupt in these cases).

Also we're not sure how this could be related to our current prime suspect, 
NFS, as the password files are not on a NFS related partition.

Even despite the severity of the current situation, we are reluctant to revert 
to 1.5.3, due to other fixes and features now in 1.6.1. 

We are actually tracking and running -current now for about a week, in hopes 
that some new work might fix the problem (and not cause more problems), but 
so far our situation has not changed.

Does anyone have any suggestions regarding how we might best approach 
determining what exactly the problem is, so hopefully to be able to fix it, 
or provide information back to the NetBSD project which may be useful.

Tim Middleton | Cain Gang Ltd | A man is rich in proportion to the number of     | www.Vex.Net   | things which he can afford to let alone. HDT