port-amd64: Re: Massive interrupt problems on Tyan 2882-D

Subject: Re: Massive interrupt problems on Tyan 2882-D
To: None <port-amd64@NetBSD.org>
From: =?ISO-8859-1?Q?Edgar_Fu=DF?= <ef@math.uni-bonn.de>
List: port-amd64
Date: 03/22/2007 17:41:48
Since the problem is probably not amd64 specific and lacking port- 
x86, I cc to port-i386 although I do not read that list.

As the subject says, I've got interrupt problems on a 2882-D (one  
single-core CPU fitted, non-SMP kernel) where drivers seem suddenly  
to cease receiving interrupts.
Specifically, ahd(4) complains about timed out SCBs being already  
complete and bge(4) about blocks that do not stop.
The machine runs fine for hours (under NFS load) and then suddenly  
locks up (almost, every disk I/O takes minutes to complete).

The more I think about it, the weirder it gets.

We have:
-- ahd0 and bge0 sharing ioapic1 pin0, thus irq5.
-- ahd1 and bge1 sharing ioapic1 pin1, thus irq10.
Given that according to Tyan's block diagrams, all of these devices  
are on the same PCI bus (Bus A of the 8131), it looks reasonable that  
they actually share two PCI Interrupts and thus two IOAPIC pins.  
Also, if I switch to a non-ioapic kernel, dmesg keeps reporting them  
as using irq5 and irq10.
-- A 36G RAID1 on ahd0 containing the OS.
-- A 928G RAID5 on ahd1 holding user data (not really, at the moment).
-- Active traffic on bge0.
-- Nothing on bge1: the interface is down.

The machine ran without problems, even with heavy I/O on ahd1  
(rebuilding RAID parity).

I had two situations where problems arose, both involving heavy usage  
as an NFS server. Unfortunately, that's exactly what the machine is  
supposed to be used for.

Both times, I had a (linux) NFS client writing large amounts of data  
to the raid on ahd1. Both times, that data came in on bge0.

First time, I got errors on ahd1 (since I didn't use bge1 at that  
time, I know nothing about that driver). But I got no errors on bge0  
nor ahd0. So this looks like a problem on IRQ10.

Second time, I got errors on both ahd0 and bge0 while ahd1 worked.

If this was an interrupt sharing issue, why would I get problems in  
case one? There's no activity on bge1 sharing the interrupt with ahd1.

If it was an issue in ahd(4), why doesn't it show up when building  
RAID parity?

It might be some issue in bge(4), but why did that affect ahd1 while  
leaving bge0 unaffected in case one?

The problem seems to be triggered by simultaneously high network and  
SCSI traffic. But in case two, the net traffic involved irq5 while  
the SCSI traffic involved irq10.

I once thought it might be some spl confusion in bge(4), but I think  
I would have had even more fun if it were. Also, this would have  
affected bge0 in case one.

Any ideas, anyone? I've got four to five identical machines to test  
all sort of things on, albeit only one storage box with really large  
amounts of disk space.