tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

SunFire v100 / Acer M5229 IDE DMA error workaround



Folks:
        I've been taunted by the IDE interface on my SunFire V100 for a long
        (LOOONG!) time with messages along the lines of:

        wdNN: DMA error writing fsbn xxxx of xxxx-yyy (wdN bn pppp; cn ccc tn 
tt sn ss), retrying
        wdN: soft error (corrected)

        This box is running RAIDFrame over 2 110GB IDE drives, one on each
        channel.  While the errors have not caused any data loss, they do
        eventually cause the IDE subsystem downgrade to slower-and-slower
        DMA modes and eventually even to PIO access to the disks.  See [1]
        and the messages in that thread for my prior attempts at getting
        rid of these errors.

        I've since tried a bunch more stuff, and found that none of the
        aceride-specific changes made any real difference.  It looks like
        for whatever reason, the chip asserts interrupts before the DMA
        is complete, or the PCI IDE code at least believed that was the
        case.  So last night, after looking at the FreeBSD and OpenBSD
        IDE code, I came up with the following set of changes, which 
        so far has not had any negative consequences on the system and
        has survived a complete RAID parity rebuild (this was the one
        case where I *always* got the DMA errors) without spewing a
        single IDE DMA-related error.  It also makes the box feel a
        bit snappier, but maybe I'm just imagining that ;)

        I realize this change is probably done in the wrong place -- I
        should probably have created a aceride-specific dma_finish method
        and done the checks there, but this is at least a proof-of-concept
        that the change works; the change also includes some more debug
        logging in the case of DMA errors, which aren't necessary to fix
        the issue but helped me diagnose it, so I've left them in for now.

        Finally, I know Manuel mentioned that doing something along these
        lines would likely have an impact on ATAPI DMA operations, and I
        have not tested it with anything beyond ATA disk -- however, I'm
        not sure that ATAPI DMA ever worked on my V100 -- I think it always
        falls back to PIO, at least with the CDROM in the system.

Patch below... I'd love comments / feedback, esp. on ATAPI use cases,
--rafal

---8<------8<------8<------8<------8<------8<------8<------8<------8<---
Index: ata/ata_wdc.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ata/ata_wdc.c,v
retrieving revision 1.90
diff -u -p -r1.90 ata_wdc.c
--- ata/ata_wdc.c       2 Oct 2008 21:05:17 -0000       1.90
+++ ata/ata_wdc.c       29 Oct 2008 15:30:20 -0000
@@ -681,6 +681,10 @@ wdc_ata_bio_intr(struct ata_channel *chp
                }
                if (wdc->dma_status != 0) {
                        if (drv_err != WDC_ATA_ERR) {
+                               printf("%s:%d:%d: DMA error (st=0x%x, 
er=0x%x)\n",
+                                   device_xname(atac->atac_dev),
+                                   chp->ch_channel, xfer->c_drive, 
+                                   wdc->dma_status, ata_bio->r_error);
                                ata_bio->error = ERR_DMA;
                                drv_err = WDC_ATA_ERR;
                        }
Index: pci/pciide_common.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/pciide_common.c,v
retrieving revision 1.38
diff -u -p -r1.38 pciide_common.c
--- pci/pciide_common.c 18 Mar 2008 20:46:37 -0000      1.38
+++ pci/pciide_common.c 29 Oct 2008 15:30:21 -0000
@@ -737,7 +738,9 @@ pciide_dma_finish(v, channel, drive, for
        ATADEBUG_PRINT(("pciide_dma_finish: status 0x%x\n", status),
            DEBUG_XFERS);
 
-       if (force == WDC_DMAEND_END && (status & IDEDMA_CTL_INTR) == 0)
+       /* XXXrkb: From FreeBSD; should probably add an evcnt here */
+       if (force == WDC_DMAEND_END && 
+           ((status & (IDEDMA_CTL_INTR | IDEDMA_CTL_ACT)) != IDEDMA_CTL_INTR))
                return WDC_DMAST_NOIRQ;
 
        /* stop DMA channel */
@@ -752,6 +755,9 @@ pciide_dma_finish(v, channel, drive, for
            BUS_DMASYNC_POSTREAD : BUS_DMASYNC_POSTWRITE);
        bus_dmamap_unload(sc->sc_dmat, dma_maps->dmamap_xfer);
 
+       /* Clear status bits */
+       bus_space_write_1(sc->sc_dma_iot, cp->dma_iohs[IDEDMA_CTL], 0, status);
+
        if ((status & IDEDMA_CTL_ERR) != 0 && force != WDC_DMAEND_ABRT_QUIET) {
                aprint_error("%s:%d:%d: bus-master DMA error: status=0x%x\n",
                    device_xname(sc->sc_wdcdev.sc_atac.atac_dev), channel,
@@ -768,6 +774,12 @@ pciide_dma_finish(v, channel, drive, for
        }
 
        if ((status & IDEDMA_CTL_ACT) != 0 && force != WDC_DMAEND_ABRT_QUIET) {
+               if (force == WDC_DMAEND_END) {
+                       aprint_error("%s:%d:%d: stopping still-busy xfer, "
+                           "status=0x%x\n", 
+                           device_xname(sc->sc_wdcdev.sc_atac.atac_dev),
+                           channel, drive, status);
+               }
                /* data underrun, may be a valid condition for ATAPI */
                error |= WDC_DMAST_UNDER;
        }
---8<------8<------8<------8<------8<------8<------8<------8<------8<---

[1] http://mail-index.netbsd.org/port-sparc64/2008/02/14/msg000101.html
-- 
  Time is an illusion; lunchtime, doubly so.     |/\/\|           Rafal Boni
                   -- Ford Prefect               |\/\/|      
rafal%pobox.com@localhost


Home | Main Index | Thread Index | Old Index