port-i386: Re: Bug in x86 ioapic interrupt code for devices with shared interrupts?

Subject: Re: Bug in x86 ioapic interrupt code for devices with shared interrupts?
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Jonathan Stone <jonathan@Pescadero.dsg.stanford.edu>
List: port-i386
Date: 03/03/2006 11:37:31

In message <du58ct$8ha$3@serpens.de>Michael van Elst writes
>tls@rek.tjls.com (Thor Lancelot Simon) writes:
>
>>One of the NetBSD Foundation servers is a dual-Opteron with an onboard
>>dual Broadcom gigabit chip.  This chip, as far as I can tell from some
>>experiments, gets seriously disturbed by the driver's tendency to
>>acknowledge interrupts even if they're not actually from the device.
>
>I believe something similar hits a Thinkpad T43 which is i386 and
>not amd64. Almost all peripherals take irq 11 and using the bge
>interface causes symptoms I correlate with bad interrupt handling.

Yes,  the interrupt handler in  bge(4) (sys/dev/pci/if_bge.c:bge_intr())
is known to give an inaccurate return code.  That problem can cause
interrupts to not be forwarded to other devices sharing the same IRQ.
This is a long-known bug in bge(4). However, every time I've tried
to turn on the 
	 #ifdef notdef"

code in bge_intr(), the resulting kernel hung.  If I had a
programmer's manual, I'd go looking for ways to ascertain if the bge
really interrupted.

I beleive the canoncal remedy used for these devices in other OSes, is
to use MSI interrupts as an implicit confirmation that the bge
actually interrupted. But since NetBSD doesn't support MSI,
that's not currently an option for NetBSD.

All that aside:

If I've understood Thor correctly, he's seeing the converse problem:
his machine shares an irq between the RAID controller and one function
on the dual-port bge (almost certainly a bcm5704) is causing the *bge*
to lock up.

In your case, calling bge_intr (if the bge is ifconfig'ed up) swallows
interrupts which are acutally from devices later in the chain of
devices sharing that interrupt, causing those devices to appear
unresponsive.  Whereas In Thor's case, it's the bge itself which
becomes unresponsive.

The only way I can imagine for these symptoms to have the same cause,
is if Thor's machine (a TNF machine which Thor administers) has all
three devices -- RAID, bge0, bge1 -- all sharing the same IRQ, and
it's the bge1 port which goes catatonic.  But that doesn't seem
to match Thor's description.

Also, I have several machines locally with multiple (e.g., four) bge
devices, where pairs of bges end up sharing interrupts. Strangely
enough, *that* seems to work fine, with all four bges happily
swallowing a gigabit of traffic simultaneously.

Very confused, and wondering if I've misunderstood one or both
descriptions,

--Jonathan