port-alpha: implementation: NetBSD on AS1200s

Subject: implementation: NetBSD on AS1200s
To: None <port-alpha@netbsd.org>
From: Stephen M Jones <smj@cirr.com>
List: port-alpha
Date: 03/06/2002 13:15:23
Hi, I've posted some questions in the past concerning implementation of
NetBSD on AS1200 (id est DS 5305) and wanted to get a little thread going.
I want to present this in a different way to avoid any flaming.

So let me explain what I'm doing.

I've got four AS1200, three are production, one is a development machine.
For the most part they don't require too much babysitting and do just fine.

Two of the boxes have a short point to point connection between two 100mbit
ethernet interfaces dedicated for NFS traffic.  Both are running servers
and clients, sharing some of their disks with each other (I realise that
others may consider this a bad thing to do, instead, it could be implemented
as one machine hosting disks to another only running a client).

The options I use are 'intr,-b' .. which basically allows an intr signal to
be interpreted in the event of an nfs hang and -b to fork off in the
background during boot so a system doesn't have to wait forever for a
server to come up.

For the machines serial consoles, I'm using a livingston portmaster 
terminal server.  8n1 9600 with no flow control.  The serial consoles 
under SRM are also set with no flow control.  

Things that work (I've talked about this before on the list, but here is
a summary):

I'm booting a -b flag (1.5.3_alpha)
I can break back to SRM *during* a boot by hitting ^C  
Once the system has booted (beyond secondary boot strap) I can not get
back to SRM except through a halt.
I can drop to the kernel debugger by sending a break signal, but ONLY
when the system is not hung.

So now a recent problem.  One of the systems will hang (its *always* the
same one) I will connect to the console via the terminal server and be
at the debugger, if I 'cont' I will typically see complaints about
tlp0 and ex0 (usually with an exliftr or similar message) and then 
nothing, its hung.  No amount of break singals or anything will get back
to SRM or a debugger.  The machine must be hard reset.

The other machine will respond and I can login as root, but any command
execution will fail because the process table is full.  Trying an 'exec halt'
or 'exec reboot' which effectively reboot the machine.  I can break to a
debugger by sending a break.

On both machines:

Trying a 'reboot' from db> *might* sometimes work, you'll see a message
from sync, but depending on how long its been waiting, both machines might
just hang there and both will have to be hard reset.

The two areas I need to address is:

1. What is causing the hang?  Since I never see this problem on the other
   machine, I'm concerned with the tlp0 (DEC500) interface.  There are an
   alarming number of collisions:

Name  Mtu     Ipkts    Ierrs  Opkts   Oerrs  Colls
tlp0  1500    4363083     0  4404094     0 311902
tlp0  1500    4363083     0  4404094     0 311902
tlp0  1500    4363083     0  4404094     0 311902
ex0   1500    6853858     0  7397736     0     0
ex0   1500    6853858     0  7397736     0     0

No other interface on any of the other 5 systems have collisions.  Also,
the cable checks out just fine.

Some messages that show up in dmesg are similar to "stray kn300 irq 44"
which maps back to an ethernet interface (though typically the ex0 and
not the tlp0!)

2. Breaking to SRM when the UNIX system is hung.  This would be great.  
   obviously when netbsd is hung you aren't going to be able to get to a
   debugger.  In my field implementation, I'm looking for pointers to where
   I could set it up (if it is possible) to go from a running or hung system
   and drop directly to SRM without having to press the two buttons infront
   of the machine (I don't live with them and nor do any other humans :)

A hack solution would be to rig up some sort of X10 system with the terminal
server to handle power, however I want to work with what I've got first.

Also as a little disclaimer, since these are production systems with a userbase
exceeding 25000 I can't really run a debuggable kernel.  On the development
box (and the third AS1200) I've run the same kernel version, but because 
they do not experience the same load I never see the same problem.

Could it be the DEC500? I don't have any more evidence than I've presented.
and unfortunately I never get any kernel messages other than the 'stray irq'
message which again typically doesn't map back to that device, which is the
primary interface.. the ex0 is the secondary.

Also, just for the record, I used to use the de driver for the DEC500.  I
ran it as the primary interface handling all traffic (including NFS) and it
would panic regularly.  I switched to the tlp driver and at the same time
added an ex device specifically for NFS traffic.

Concerning the breaking to SRM, I would rather here from folks with 
experience and not folks speculating.  Thanks :)

smj