Subject: Re: daily crashes with 1.6.1
To: NetBSD-current Discussion List <current-users@NetBSD.ORG>
From: Tim Middleton <x@Vex.Net>
List: current-users
Date: 07/04/2003 16:20:45
Greg A. Woods wrote:
> It is also different hardware.  :-)

Hi Greg.

D'Arcy reports he has the exact same hardware running 1.6.1 at Givex on 
several boxes, without issues... except one box. One of their boxes (same 
hardware) *is* having problems to. The common denominator *seems* to be both 
these boxes are running NFS... but we have not been able to confirm this is 
the problem yet.

> Do your scripts also first write to /etc/ptmp (after carefully creating
> it with O_CREAT|O_EXCL) then run "pwd_mkdb -p /etc/ptmp" (i.e. in the

Of course not. You know us better than that. <-; Actually we don't write to 
/etc/ptmp on Vex, different filename... 

> (it might not be such a bad idea to add an flock() call for /etc/ptmp to

Yeah, that's a good idea. I've long meant to rewrite the scripts on Vex to 
work on a fundementally different way which would avoid all the potential 
problems... unique temp files... store system logins in a seperate file 
rather than the one being overwritten... etc.. but you know... haven't got 
there yet. 

I can't see how this problem has anything to do with scripting though. These 
scripts have run forever... they may have certain design problems... but any 
problems they've had, we're well familiar with.

> Were any of the auto-updated files corrupt in a crash after having
> disabled the cron jobs?

The only files that have been corrupted by the lock up that we have found are 
master.passwd (this corruption would happen only for aobut 50% of the 
crashes) and once I believe /etc/group was messed up.

> I would lean more towards it being a hardware problem....

What hardware would you suggest? Drive? Conroller? It seems rather an 
incredible coincidence that this hardware would fail just when we upgraded to 
1.6.1 when it was all running fine with 1.5.3 for so long.

We were, however, having SCSI contoller problems when we first upgraded. 1.5.3 
ran perfectly (no hardware changes), but 1.6.0 and 1.6.1 releases would not 
work with the onboard STL2 scsi controller. We disabled it eventually in the 
BIOS and put in an adaptec card. However current seems to have fixed the 
problems with the scsi driver, and we've moved back to the on-board 
controller at the moment (trying basically anything to stop the crashing... 
we even have contemplated taking scsi out of the equation by dropping in IDE 
drives. <-:) Personally, at this point, despite reluctance, I'm wanting to 
just go back to 1.5.3 which was stable for us. I'm quite sure these problems 
will disappear if we do (and if they don't, then I'd think it was 
hardware)... but others (not mentioning any names <-:) are against this for 
various reasons. So I'm fishing... 

Tim Middleton | Cain Gang Ltd | I felt very much alone, so I took another     | www.Vex.Net   | ginger-snap. --Greene (TWMA)