Subject: Re: daily crashes with 1.6.1
To: NetBSD-current Discussion List <current-users@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: current-users
Date: 07/04/2003 17:57:34
[ On Friday, July 4, 2003 at 16:20:45 (-0400), Tim Middleton wrote: ]
> Subject: Re: daily crashes with 1.6.1
>
> D'Arcy reports he has the exact same hardware running 1.6.1 at Givex on 
> several boxes, without issues... except one box. One of their boxes (same 
> hardware) *is* having problems to. The common denominator *seems* to be both 
> these boxes are running NFS... but we have not been able to confirm this is 
> the problem yet.

I'd start swapping systems (whole systems if possible -- just keep the
disks; memory, motherboards, cables, and controllers otherwise), and see
if the problem moves with some component(s) or not.....

> I can't see how this problem has anything to do with scripting though.

I think you're right -- I would suspect they're merely victims because
they run regularly and can more easily get caught while running.

> The only files that have been corrupted by the lock up that we have found are 
> master.passwd (this corruption would happen only for aobut 50% of the 
> crashes) and once I believe /etc/group was messed up.

I'd guess the only thing interesting about the corrupted files would be
that they were open for write at the time of the crash.

> What hardware would you suggest? Drive? Conroller? It seems rather an 
> incredible coincidence that this hardware would fail just when we upgraded to 
> 1.6.1 when it was all running fine with 1.5.3 for so long.

That's the difficult/fun part to figure out!  ;-)

If you've got identically configured machines where the only diffence is
supposed to be the data on the disks then I'd try swapping the disks and
see if the problem moves (and swap any external cabling such as network
connections if necessary too of course).

It could even be the power supplies.  If those machines are like the
newer ones at ACI with the Intel STL2 motherboard then I think ACI's
machines have beefier 400w power supplies (IIRC the STL2 motherboard
alone can draw over 250w with fully loaded card slots, CPUs, and RAM).
Add any disks and you may be exceeding the PS specs., particularly the
+5vdc and +12vdc draw (most ATX power supplies have lots of juice for
the 3.3vdc) (and some of ACI's machines have four cheetahs internally).

> However current seems to have fixed the 
> problems with the scsi driver, and we've moved back to the on-board 
> controller at the moment (trying basically anything to stop the crashing... 

Well that will eliminate the additional controller as a potential
failure point, and of course also eliminate what little power it draws
too if you've pulled it as well.

> So I'm fishing... 

Well I guess I should ask the obvious:

Is this a real crash, a silent reboot, or a hard hang?

If it's a crash has anyone managed to capture any system dumps or kernel
stack backtraces from ddb?

If it's a hard hang can you get into ddb on the console?

If it's a slient reboot (i.e. the machine is working at one moment and
then suddenly without warning or record of any error on the console it's
counting its memory and getting ready to boot again), and if it still
does this with -current, then I'd thing that points more towards a
hardware problem, though there were rumours on this list not so long ago
about bugs that could cause sporadic spontaneous reboots in some kinds
of i386 machines.

Here's one more fishing line for your expedition:

Are the "wake-on-LAN" features all completely disabled in the BIOS for
all the interfaces in use?  I expect you're using one or both of the
on-board FXP interfaces.  NFS could result in lots of additional traffic
on the interfaces and if there's any chance of a collision (i.e. if
they're not running full-duplex to buffered ethernet switches) then
maybe some random garbage signal may eventually look enough like a WOL
signal to cause a problem (though in theory WOL should ignore collisions
and runts and other garbage and in theory WOL should be simply ignored
if the system is already running).

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>