NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/41867: ahc-driver freezes after first device timeout and looses error information

>Number:         41867
>Category:       kern
>Synopsis:       ahc-driver freezes after first device timeout and looses error 
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Aug 10 13:40:00 +0000 2009
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 4.0
Dr. Nagler & Company GmbH
System: NetBSD s013 4.0 NetBSD 4.0 (NSW-S013-new) #37: Mon Aug 10 11:55:32 CEST 
2009 wgstuken@s013:/usr/src/sys/arch/amd64/compile/NSW-S013-new amd64
Architecture: x86_64
Machine: amd64
        After the first deivce timeout on the SCSI-bus, no further commands are 
executed on this SCSI-bus anymore.
        We have this problem on several machines (running different versions 
(3.x und 4.x) of NetBSD) with tape drives.
        (DAT and VXA). We are using some 19160, 29160 and 29160N controllers - 
they all share this problem.
        I've located the problem for it in a missing THAW call during the 
device-reset processing.
        The channel is freezed in ahc_set_recoveryscb(), but never thawn again.

        The next problem in the ahc driver was then, that the failed request is 
returned to caller without any error indication.
        I will set an error indication in the abort_scsc function - that looks 
the correct place to me, because any aborted request
        should return an error from my point of view.

        During the anlyses I've recognized another problem in the driver with 
the device reset processing.
        If an explicit device reset is requested from user level, the same scb 
would be queued to the controler twice. The only device
        that seems to be affected by this is the CD-driver - I've found no 
other location where the relevant flags is set.

        The following two patches fixed the above problems when a device 
timeout occures - or a device reset is requested.

        Neverless I haven't found the main cause for the timeout - I assume 
there is another bug somewhere in the ahc driver.
        I haven't found a way to trigger it - sometimes it takes minutes until 
it happens, sometimes days. I'm shure it is not related the
        the wiring of the tape devices on the SCSI-bus ...

        With this patch EIO is returned to the caller after device timeout and 
it is no longer required to reboot the system.
        Connect a tape to an ahc controler and write to it. Wait until a device 
timeout occures ... 
        the following two files in /usr/src/sys/dev/ic must be updated to fix 
the problem.

RCS file: RCS/aic7xxx.c,v
retrieving revision 1.1
diff -u -r1.1 aic7xxx.c
--- aic7xxx.c   2009/08/07 12:15:32     1.1
+++ aic7xxx.c   2009/08/07 12:27:51
@@ -1290,6 +1290,12 @@
                                                    "Bus Device Reset",
+                               /* reset freeze status - was set in 
ahc_set_recoveryscb() - otherwise we will hang ... */
+                               scsipi_channel_thaw(&ahc->sc_channel, 1);
+                               if (ahc->features & AHC_TWIN)
+                                 scsipi_channel_thaw(&ahc->sc_channel_b, 1);
                                printerror = 0;
                        } else if (ahc_sent_msg(ahc, AHCMSG_EXT,
                                                MSG_EXT_PPR, FALSE)) {
@@ -5880,6 +5886,11 @@
                        if ((scbp->flags & SCB_ACTIVE) == 0)
                                printf("Inactive SCB on pending list\n");
+                       /* set error status - otherwise theese scb will signal 
success to the initator .... */
+                       if (scbp->xs != NULL && scbp->xs->error != XS_NOERROR)
+                         scbp->xs->error = XS_RESET; /* we use XS_RESET here - 
it may be a good idea to retry the command later */
                        ahc_done(ahc, scbp);
RCS file: RCS/aic7xxx_osm.c,v
retrieving revision 1.1
diff -u -r1.1 aic7xxx_osm.c
--- aic7xxx_osm.c       2009/08/10 13:15:03     1.1
+++ aic7xxx_osm.c       2009/08/10 13:16:28
@@ -323,8 +323,8 @@
                        hscb->control |= MK_MESSAGE;
                        ahc_execute_scb(scb, NULL, 0);
-               ahc_setup_data(ahc, xs, scb);
+               else /* do not use the scb a second time - it has been freed by 
the ahc_execute_scb processing above ... */
+                       ahc_setup_data(ahc, xs, scb);


Home | Main Index | Thread Index | Old Index