NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/41579: siista driver interrupt and command completion handling broken



>Number:         41579
>Category:       kern
>Synopsis:       siista driver interrupt and command completion handling broken
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jun 12 10:45:00 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 5.0
>Organization:
Dr. Nagler & Company GmbH
        
>Environment:
        
        
System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009 
wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
        Our systems with io-cards that uses the siisata driver has locked up 
after a short time all the time.

        I've analysed the problem and located two errors in the driver.

        1. if a command completes very fast after it has been started while the 
interrupt handling is still running for that port, the driver
           clears any pending interrupt on interrupt routine exit for the port. 
This may clear the next pending interrupt, if it is already present.
        2. if a bio-command runs into timeout, the completion routine does not 
check for the timeout condition and the failed transfer never gets
           to the ata-routines. (the code in the atapi-completion routine is 
broken too.)

        The cause for the first problem is a general error in the interrupt 
code. You may never get some interrupt conditions, process them and
        clear it (or all interrupts as done by the driver) later.
        The driver must clear the interrupt conditions that it is gooing to 
process immidiatly after determining the set of conditions and before it
        starts processing of the conditions. Otherwise there is a risk of 
loosing some conditions.
        The patch below will correct this behavior. The automatic clear of the 
completion interrupt by reading the slot-status is disabled
        for security reasons by the patch. It makes no longer sence to use it 
anymore and makes debugging of the driver very hard ...

        The cause for the second problem is just a missing check in the routine 
for the timeout condition.
        The patch below will correct this in siisata_bio_complete().
        The siisata_atapi_complete() routine has been corrected too. But the 
patch will not change the processing order (timeout versus interrupt),
        because I'm not shure if this is just another error in the driver or if 
there is a special reason why the interrupts for atapi devices are
        not cleared if a command has timed out. Someone else should have a look 
at this, because I think the current version of the atapi completion
        routine is incorrect! At least it is inconsistent to the command and 
bio completion routines.

        While debugging the driver I've seen another programming error, that 
will not influence the operation as long as there is only one CPU
        nodifiying the status register at the time. Ths Stealvine chip has a 
Bit-set (PCS) and Bit-clear (PCC) semantic for the port configuration register.
        So it makes no sence - or may lead to unexpected results if another CPU 
is just clearing a bit - to read the old status, add some bits and
        then write the result into the bit-Set-register.
        The way that should be used is just to write the bits that should be 
set now to the PCS register (or PCC when clearing a bit).
        There is an own definition for the read-mode of the port status "PS"in 
the headerfile. That should be used in all read operations and not
        the PCS definition.
        The patch below will fix this too.

        With this patch, the drivers runs stable on our systems without locking 
up anymore.

        remark:
        We have only ata devices connected - so the atapi code does not run and 
we have never seen any real error of the drives up to now.
        The chip-manual accedently says nothing about the correct order reading 
the cause of the error and clearing the error-interrupt.
        And this patch will change the order from "read error then clear 
interrupt" to "clear interrupt prior read error status".
        As long as only one slot will ever used by the driver - this means only 
one command is on the chip - this should be no problem at all.
        Due to the fact that the chip stops processing after the first error 
and all pending commands need to be reissued again, the order should
        be not relevant. But I can be wrong here ..
        If desired (or required) it would be easy to delay clearing the 
interrupt condition until the error status has been read.
        (attention: the port-slot-status may not be read prior clearing the 
completion interrupt! Otherwise there will be the next whole in
        interrupt processing and some interrupts may be lost ... So the 
command-complete interrupt must be cleared prior processing!)
>How-To-Repeat:
        Setup a system with e.g. an SIL3124 controller and start IO in it. 
After some time you will loose the contact to a disk due to a missed
        interrupt and the broken timeout handling in siisata_bio_complete().
>Fix:
        The following patch to /usr/src/sys/dev/ic/siisata.c will fix the 
problems above. (output of "diff -c")

*** siisata.c-orig      Wed Jun 10 15:50:37 2009
--- siisata.c   Fri Jun 12 10:20:19 2009
***************
*** 233,238 ****
--- 233,240 ----
        /* come out of reset, 64-bit activation */
        PRWRITE(sc, PRX(chp->ch_channel, PRO_PCC),
            PR_PC_32BA | PR_PC_PORT_RESET);
+       /* disable clear interrupt on PSS read */
+       PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_INCOR);
        /* initialize port */
        siisata_reinit_port(chp);
        /* clear any interrupts */
***************
*** 393,398 ****
--- 395,410 ----
                for (i = 0; i < sc->sc_atac.atac_nchannels; i++)
                        if (is & GR_GIS_PXIS(i))
                                siisata_intr_port(sc, &sc->sc_channels[i]);
+ #ifdef DIAGNOSTIC
+               is &= ~GR_GIS_PXIS_MASK;
+               if (is) { /* reset any unexpected interrupts to avoid busy 
loops ... */
+                       log(LOG_WARNING, "%s: resetting unexpected interrupt 
(GIS 0x%x)\n", SIISATANAME(sc), is);
+                       GRWRITE(sc, GR_GIS, is);
+               }
+ #else
+               if (is & ~GR_GIS_PXIS_MASK)
+                       panic("%s: %s: unexpected interrupt (GIS 0x%x)", 
SIISATANAME(sc), __func__, is);
+ #endif
        }
        return r;
  }
***************
*** 407,421 ****
        SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
            SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);
  
        if ((xfer != NULL) && (xfer->c_intr != NULL))
                xfer->c_intr(chp, xfer, slot);
  #ifdef DIAGNOSTIC
-       else
                log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
  #endif
! 
!       /* clear some (ok, all) ints */
!       PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);
  
        return;
  }
--- 419,434 ----
        SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
            SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);
  
+ /* remark: the called routine must clear all processed interrupt itself - 
otherwise we may loose some interrupts ... */
        if ((xfer != NULL) && (xfer->c_intr != NULL))
                xfer->c_intr(chp, xfer, slot);
+       else {
  #ifdef DIAGNOSTIC
                log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
  #endif
!               /* clear some (ok, all) ints */
!               PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);
!       }
  
        return;
  }
***************
*** 430,436 ****
        int slot = SIISATA_NON_NCQ_SLOT;
  
        /* wait for ready */
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
                DELAY(10);
  
        prb = schp->sch_prb[slot];
--- 443,449 ----
        int slot = SIISATA_NON_NCQ_SLOT;
  
        /* wait for ready */
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
                DELAY(10);
  
        prb = schp->sch_prb[slot];
***************
*** 490,496 ****
                    SIISATANAME(sc), chp->ch_channel);
                /* XXX and then ? */
        }
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
                DELAY(10);
        PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
            PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
--- 503,509 ----
                    SIISATANAME(sc), chp->ch_channel);
                /* XXX and then ? */
        }
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
                DELAY(10);
        PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
            PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
***************
*** 546,552 ****
                schp->sch_sstatus)) {
        case SStatus_DET_DEV:
                /* wait for ready */
!               while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS))
                    & PR_PS_PORT_READY))
                        DELAY(10);
  
--- 559,565 ----
                schp->sch_sstatus)) {
        case SStatus_DET_DEV:
                /* wait for ready */
!               while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS))
                    & PR_PS_PORT_READY))
                        DELAY(10);
  
***************
*** 781,792 ****
            ("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);
  
        pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
  
        if ((xfer->c_flags & C_TIMEOU) != 0)
                goto command_done;
  
        if (pis & PR_PIS_CMDCMPL) {
!               /* get slot status, clearing completion interrupt */
                pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
                /* is this expected? */
                if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 794,807 ----
            ("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);
  
        pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+       /* clear only those interrupts, we are gooing to server now ... */
+       if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
  
        if ((xfer->c_flags & C_TIMEOU) != 0)
                goto command_done;
  
        if (pis & PR_PIS_CMDCMPL) {
!               /* get slot status - interrupt cleared above - PR_PC_INCOR set 
... */
                pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
                /* is this expected? */
                if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1062,1070 ****
        uint32_t *prbfis = (void *)fis;
  
        pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
  
        if (pis & PR_PIS_CMDCMPL) {
!               /* get slot status, clearing completion interrupt */
                pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
                /* is this expected? */
                if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1077,1090 ----
        uint32_t *prbfis = (void *)fis;
  
        pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+       /* clear only those interrupts, we are gooing to server now ... */
+       if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
+ 
+       if ((xfer->c_flags & C_TIMEOU) != 0)
+               goto command_done;
  
        if (pis & PR_PIS_CMDCMPL) {
!               /* get slot status - interrupt cleared above - PR_PC_INCOR set 
... */
                pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
                /* is this expected? */
                if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1116,1122 ****
  command_done:
        schp->sch_active_slots &= ~__BIT(slot);
        chp->ch_flags &= ~ATACH_IRQ_WAIT;
!       callout_stop(&chp->ch_callout);
  
        chp->ch_queue->active_xfer = NULL;
  
--- 1136,1147 ----
  command_done:
        schp->sch_active_slots &= ~__BIT(slot);
        chp->ch_flags &= ~ATACH_IRQ_WAIT;
!       if (xfer->c_flags & C_TIMEOU) {
!               ata_bio->error = TIMEOUT;
!       } else {
!               callout_stop(&chp->ch_callout);
!               ata_bio->error = NOERROR;
!       }
  
        chp->ch_queue->active_xfer = NULL;
  
***************
*** 1270,1278 ****
  {
        struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
  
!       PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
!           PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_PORT_INITIALIZE);
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
                DELAY(10);
  }
  
--- 1295,1302 ----
  {
        struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
  
!       PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_PORT_INITIALIZE);
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
                DELAY(10);
  }
  
***************
*** 1281,1289 ****
  {
        struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
  
!       PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
!           PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_DEVICE_RESET);
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
                DELAY(10);
  }
  
--- 1305,1312 ----
  {
        struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
  
!       PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_DEVICE_RESET);
!       while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
                DELAY(10);
  }
  
***************
*** 1456,1464 ****
                        periph->periph_cap |= PERIPH_CAP_CMD16;
                
                        /* configure port for packet length */
!                       PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS),
!                           PRREAD(siic, PRX(chp->ch_channel, PRO_PCS)) |
!                           PR_PC_PACKET_LENGTH);
                }
                /* XXX This is gross. */
                periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
--- 1479,1485 ----
                        periph->periph_cap |= PERIPH_CAP_CMD16;
                
                        /* configure port for packet length */
!                       PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS), 
PR_PC_PACKET_LENGTH);
                }
                /* XXX This is gross. */
                periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
***************
*** 1649,1657 ****
        }
  
        pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
  
        if (pis & PR_PIS_CMDCMPL) {
!               /* get slot status, clearing completion interrupt */
                pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
                /* is this expected? */
                if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1670,1680 ----
        }
  
        pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+       /* clear only those interrupts, we are gooing to server now ... */
+       if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
  
        if (pis & PR_PIS_CMDCMPL) {
!               /* get slot status - interrupt cleared above - PR_PC_INCOR set 
... */
                pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
                /* is this expected? */
                if ((schp->sch_active_slots & __BIT(slot)) == 0) {

>Unformatted:
        
        


Home | Main Index | Thread Index | Old Index