Subject: kern/26568: Yesterday's "pciide" `irqack fix' breaks Promise 202xx controllers
To: None <gnats-bugs@gnats.NetBSD.org>
From: None <paul@Plectere.com>
List: netbsd-bugs
Date: 08/06/2004 02:24:43
>Number:         26568
>Category:       kern
>Synopsis:       an occasional "pdcide0:0 bogus intr (reg 0x1xxxxxxxx)" is fatal
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Aug 06 09:28:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     Paul Shupak
>Release:        NetBSD 2.0G
>Organization:
	
>Environment:
	
	
System: NetBSD svcs 2.0G NetBSD 2.0G (SVCS) #262: Fri Aug 6 01:17:44 PDT 2004 root@svcs:/sys/arch/i386/compile/SVCS i386
Architecture: i386
Machine: i386
>Description:
	
	For several years I have seen (a few a day - under heavy load)
spurious interrupts on Promise FastTrack-66s used as part of RAID arrays.
With yesterday's changes, the "bogus" interrupts repeats indefinitely
instead of stopping after a single instance (i.e. the machine "printf"'s
until either the reset or power buttons are hit).  This has never been a
problem on these controllers before (I have two as part of RAID arrays in
different machines (the Via chipset gives ~6 times the "bogus" interrupts
as the Intel chipset - VT82C691 vs. i810, both give "bogus" interrupts for
their PDCs).  Neither card is "sharing" PCI interrupts with any other device
in either machine (one has irq 5 dedicated. the other has irq 11).
	Reverting pdcide.c to version 1.11 from 1.12 solves the problem for
me (i.e.  back to the occasional single "bogus" interrupt - about 15-25 a day
on the Via machine about 2-4 a day on the i810).
>How-To-Repeat:
	
	Boot a machine with e FastTrack-66, then cause heavy disk activity
to occur (a RAID5 parity rebuild works just fine).  Watch the endless loop
of "bogus" messages.
>Fix:
	
	Revert the change for the Promises?  Maybe a further test on the
wdc state beyond the simple "wdcintr(wdc_cp)"? - Either way, please do not
write the "IDEDMA_CTL" during the interrupt without acknowledging the interrupt
to the hardware(i.e. the EOI dance on x86) (If a DMA is really pending, we
can get into the infinite-loop case described (remember, now the WDC `cause'
has been cleared) beginning when the outstanding DMA completes or we lose
the outstanding transaction - neither is a good choice;  The outstanding
request causes another "bogus" interrupt, etc), or look into a non-zero
return and doing the EOI dance to prevent redelivery of the same interrupt
(Note: the case in the Promise returns zero, if we're eating the interrupt,
we probably should return one -- i.e. "rv = 1;" ? - I didn't test this, but
it seem like it might be simple enough to work).
>Release-Note:
>Audit-Trail:
>Unformatted: