tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Interrupt storm mitigation needed



Thor Lancelot Simon <tls%panix.com@localhost> writes:

> We saw this on a platform of similar vintage at a former employer of
> mine, and indeed the uhci was one of the devices involved.

While googling just now, I discovered that you've observed the problem
before within the NetBSD Foundation, as well:

http://mail-index.netbsd.org/port-amd64/2006/03/01/0004.html

What happens seems to be that the Intel E7520 chip set has a bug where
an interrupt is being handled, and the ioapic pin temporarily masked,
and the chip set somehow decides to make the masked interrupt pop up
somewhere else.

I've seen other describe the exact symptoms you did, so I'm assuming
that you had the same Intel chip set I do, but in a different machine,
where it was wired up differently.  Thus, you got leakage from the amr
to the bge, whereas I (and others with Dell products) get a different
pattern:

I've been observing my system closely, with uhci in polled mode, using
Joerg's patch, and now also with SMP enabled.  What I've found is that
disk I/O, with the amr interrupting at ioapic1, pin 14, leaks interrupts
to ioapic0, pin 18 (uhci2), while network I/O, with the wm interrupting
at ioapic2, pin 0, leaks interrupts to ioapic0, pin 16 (uhci0).  When
the source interrupt rate is low, a low percentage of the interrupts
leak, but when the source driver is loaded down with lots of work, the
percentage increases.  When I'm running a "-j 4" system build, and, at
the same time, spooling a full backup to a scratch disk on a neighboring
system, I end up with something on the order of 10% of network
interrupts, and 30% of disk interrupts, leaking to the wrong ioapic, as
opposed to about 1% and 2%, respectively, when the system is lightly
loaded.  It feels exponential, but I haven't plotted the data.

The hangs I've seen are probably related to the feedback loop where a
busy source interrupt handler means more work for the leaked interrupt
handler, which in turn reduces the system's ability to handle interrupts
quickly, leading to yet more leaked interrupts.  I'm guessing that my
system is surviving this, and not letting it escalate into full on
hangs, because Joerg's patch has it spending much less time on each
leaked interrupt, so load peaks don't get escalated out of control.

-tih
-- 
Popularity is the hallmark of mediocrity.  --Niles Crane, "Frasier"


Home | Main Index | Thread Index | Old Index