netbsd-bugs: Re: kern/15841: WDC/ATA PCMCIA and CardBus flash disk unhandled interrupt locks kernel

Subject: Re: kern/15841: WDC/ATA PCMCIA and CardBus flash disk unhandled interrupt locks kernel
To: None <tad@entrisphere.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-bugs
Date: 03/09/2002 21:36:30
On Fri, Mar 08, 2002 at 04:32:13PM -0800, tad@entrisphere.com wrote:
> I have been using IBM microdrives in both PCMCIA (on our mpc8260
> port) and CardBus slots (on our i386 development machines) without
> any trouble for months and months.  They work wonderfully.
> 
> Now, we've begun switching to compact flash flash ATA disks (instead
> of compact flash spinning ATA disks), because we don't want any
> moving parts.
> 
> I've tried a couple of different PC Card and Compact flash flash
> [sic] disks (listed below).
> 
> The trouble I'm having is that I'm receiving a disk interrupt that
> the wdc driver is not expecting (WDCF_IRQ_WAIT is not set), which
> results in the interrupt never being cleared, thus getting stuck
> in an infinite interrupt loop.
> 
> I have verified this infinite interrupt loop is without any doubt
> the problem that is occurring on the mpc8260 board.  And I am
> assuming this is why the NetBSD-i386 host hangs as well (though I
> haven't instrumented it to check, the behavior is identical).
> 
> The hang is very timing related.  It tends to occur much more
> frequently during heavy interrupt activity.  It doesn't seem to be
> strongly correlated to disk activity alone.
> 
> On our mpc8260 device, we have the PCMCIA controller, and two 8
> port UARTs interrupting on the same IRQ.  It seems to take UART
> activity coupled with disk activity to cause the interrupt problem.
> The disk alone seems to run fine.
> 
> As soon as I start heavy uart activity, I tend to see several log
> messages per second (see the log() call in the fix below).
> Acknowledging the interrupt from the ATA drive is enough to make
> the interrupt go away, and elicits no complaints from the ATA or
> WDC drivers.
> 
> The NetBSD-i386 box tends to hang shortly after starting a "dd
> if=/dev/r${disk}${rawpart} of=/dev/null bs=512".  However, I don't
> know much about how it is configured or what else may be sharing
> the disk interrupt.
> 
> Obviously, the above hack is quite a hack, but I have spent several
> days tracking this problem down and am at a loss for how to proceed.
> 
> I tried updating the WDC and ATA drivers to NetBSD-current (as of
> last night), but the problem did not go away.
> 
> Has anyone seen anything like this before, or have some suggestions on
> what I could do to narrow the problem down?

Well, the problem is that the ATA interface was never designed for shared
interrupts: reading the status register will clear the pending interrupt,
which opens the door to race condition. The ATA device is supposed to
update the status register before posting the interrupt.
For pciide devices this is solved by using status registers in the pciide
device itself (which is vendor-specific, unfortunably). But for pcmcia/cardbus
devices I can't see how we can solve this, unless there are vendor-specific
registers where we could check for pending interrupt, designed to work in
shared interrupt context.

> ------------------------------------------------------------------------------
> 	intrhandler()
> 	{
> 		handled = 0;
> 		for(ih = handlers_for_this_irq; ih != nil; ih = ih->next) {
> 			handled += ih->handle(ih->arg);
> 			if(handled)
> 				break;
> 		}
> 		if(!handled && irq == WDC_DISK_IRQ) {
> 			ih = find_ata_disk_ih();
> 			wdcfix(ih->arg);
> 		}
> 	}
> 
> ------------------------------------------------------------------------------
> 
> 	void
> 	wdcfix(void *arg)
> 	{
> 		struct channel_softc *chp;
> 
> 		chp = arg;
> 
> 		if(chp->ch_flags & WDCF_IRQ_WAIT)
> 			return;
> 
> 		wdcwait(chp, 0, 0, 0);
> 		log(LOG_ERR, "%s: read status %02x\n",
> 			chp->wdc->sc_dev.dv_xname, chp->ch_status);
> 	}

This is not rigth. I think this can be implemeted all in wdc_intr()

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
--