Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Acer M5229 IDE bugs (esp. on sparc64)



[...sorry for the necro-posting, just happened to get back to this last night...]

Manuel Bouyer wrote:
On Thu, Feb 14, 2008 at 02:17:47PM -0500, Rafal Boni wrote:
[...]
I see that wdc->dma_status is always 0x04 (WDC_DMAST_UNDER), which is a synthetic error generated only by pciide_dma_finish(). I'm guessing that the suspect pciiide_dma_finish() is the one called from wdcintr(). Because the rev of the M1559 IDE controller I have doesn't have a chan-id register to determine which channel caused an interrupt, for this chip we end up *always* checking both channels, and the code in wdcintr() / pciide_dma_finish() looks very suspicious... stop DMA first, ask questions later.

AFAIK it's not: before stopping the DMA channel, we check if the controller
did interrupt (status & IDEDMA_CTL_INTR).
We have to stop the DMA first, because in some case the DMA engine will still
be active at end of transfer (if the device has less data to return than
requested for example - as the comment says, it's a valid condition for
ATAPI devices).

Ok, in my case the usage shouldn't include any ATAPI to speak of (one disk on each channel which are part of a RAID-1, an un-used CD-ROM on one of the channels). If I add a bit of debug to pciide_dma_finish like the below:

@@ -768,6 +769,12 @@ pciide_dma_finish(v, channel, drive, for
        }

if ((status & IDEDMA_CTL_ACT) != 0 && force != WDC_DMAEND_ABRT_QUIET) {
+               if (force == WDC_DMAEND_END) {
+                       aprint_error("%s:%d:%d: stopping still-busy xfer, "
+                          "status=0x%x\n",
+                          device_xname(sc->sc_wdcdev.sc_atac.atac_dev),
+                          channel, drive, status);
+               }

I see pretty frequent messages like:

aceride0:1:0: stopping still-busy xfer, status=0x65
or
aceride0:0:0: stopping still-busy xfer, status=0x25

during the RAID parity rebuild after a dirty reboot (in this case I forced a dirty reboot just to test). Note that channel 1 is the one with the CDROM. Note that this happens other times besides the parity rebuild, that's just the easiest way to guarantee that I'll get them.

Another interesting thing is a hack I took from OpenBSD to not skip channels that don't have the WDCF_IRQ_WAIT flag set in pciide_pci_intr() [1] seemed to make the controller behave better -- much fewer of the DMA errors with status WDC_DMAST_UNDER, and in fact the interface downgraded to Ultra/33 (from /66 originally) and then produced no further errors.

That last bit does make me wonder if this is really a confluence of two things -- some generic interrupt / DMA handling error, along with either a setup bug for Ultra/66 mode or the inability of the chip to handle Ultra/66 transfers on both channels. However, as I said before the FreeBSD fix for ATA66 byte-count-something-or-other didn't help here.

If this chip asserts IDEDMA_CTL_INTR supriously, and we need to check
IDEDMA_CTL_ACT instead, then it's broken. So it needs a private intr
routine, and it needs to disable DMA for ATAPI devices.

From the above debug code I added, it does look like the interrupt is getting there early, so maybe this needs to be done. However, I hate adding code like that without having any other platform with this god-forsaken chip in it to test on.

--rafal

[1] http://www.openbsd.org/cgi-bin/cvsweb/src/sys/dev/pci/pciide.c.diff?r1=1.266&r2=1.267&f=h

Home | Main Index | Thread Index | Old Index