Subject: Re: MP AS4100 & CS20 hanging
To: Jeff Workman <jworkman@pimpworks.org>
From: Stephen M. Jones <smj@cirr.com>
List: port-alpha
Date: 01/03/2003 12:09:01
Jeff asked about my CS20 configurations ..

I'm only using one and its actually quite barebones (CS20 hardware support
w/ NFS, quotas, no raidframe).

When 1.6 was released I immediately got two of them off of 1.5x that I was
testing with and did a fresh install of 1.6 MP .. I tested them out on the
bench and they seemed to be working great .. I installed them on site, 
they came up, NFS mounted filesystems off of the AS1200s and got to work.
About four our five hours later everything 'locked up'.  Nothing on the
network could be accessed.  What had happened is that, somehow, when one
of the machines hung it would assert its port both etherswitches.  Each
NIC goes to a seperate switch... one switch is for public data, the other
is for private (NFS and such).  When I got on site the LEDs on the switches
and NIC ports on the CS20 were strobing.  I enabled consoles on them and was
able to see it occur in action .. no messages on the consoles.  I took the
machines home and ran them for hours doing semi-normal, but heavy I/O transfers
between them .. I was eventual able to get one to hang .. 

I decided to get off the MP code (when I've run 1.5Z on the spare AS1200 I
noticed a similar 'lock up' behaviour which requires going to RCM) .. when
I booted single processor I could not get the machines to hang as they did.
I could get them to panic and drop to a debugger, but not hard hang .. so,
I installed one .. and its been in place for nearly three months .. I'm going
to install the second one soon .. but, I'm really worried about how they 
behaved during a crash with an MP kernel.

Again, these are very basic CS20 generic kernel configurations .. no special
magic