NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-i386/41706: disk subsystem unresponsive after (recovered) disk failure



On Tue, Jul 28, 2009 at 09:58:59PM +0200, Manuel Bouyer wrote:
> On Sun, Jul 12, 2009 at 03:05:00PM +0000, bad%bsd.de@localhost wrote:
> > >Description:
> >     
> > sd1 failed on the above system a couple of days ago.  What I could see
> > on the console were the messages from ahc1 being reset.  sd1 became
> > unready and would no longer respond positivly to a TEST UNIT READY command
> > (firmware diagnostic failure given as the reason).
> > 
> > The system sat there for 2 more days without further kernel messages.
> > Pressing return on the console would produce a new login prompt from getty.
> > The system was pingable and did accept TCP connections (e.g. to the SSH 
> > port).
> > But no disk IO would happen and no error messages were printed.
> > IOW. the block IO subsystem seems to have been deadlocked at a high level.
> 
> This is an issue with timeouts in the ahc driver (I found with a tape drive
> where some mt or chio operation would take too long). I have a patch for this
> (on a powered down system, I'll have a look tomorow).
> from memory, the workaround was to not send BDR message and directly do a
> bus reset.

Attached is the patch I used. I also fixed the value of
CAM_CMD_TIMEOUT so it doens't match XS_TIMEOUT only by accident.

-- 
Manuel Bouyer, LIP6, Universite Paris VI.           
Manuel.Bouyer%lip6.fr@localhost
     NetBSD: 26 ans d'experience feront toujours la difference
--
Index: aic7xxx_cam.h
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/aic7xxx_cam.h,v
retrieving revision 1.4
diff -u -p -u -r1.4 aic7xxx_cam.h
--- aic7xxx_cam.h       14 Mar 2006 15:24:30 -0000      1.4
+++ aic7xxx_cam.h       29 Jul 2009 16:27:42 -0000
@@ -71,7 +71,7 @@ typedef enum {
        CAM_REQ_INVALID = XS_DRIVER_STUFFUP,    /* CCB request was invalid */
        CAM_PATH_INVALID,                       /* Supplied Path ID is invalid 
*/
        CAM_SEL_TIMEOUT = XS_SELTIMEOUT,        /* Target Selection Timeout */
-       CAM_CMD_TIMEOUT,                        /* Command timeout */
+       CAM_CMD_TIMEOUT = XS_TIMEOUT,           /* Command timeout */
        CAM_SCSI_STATUS_ERROR,                  /* SCSI error, look at error 
code in CCB */
        CAM_SCSI_BUS_RESET = XS_RESET,          /* SCSI Bus Reset Sent/Received 
*/
        CAM_UNCOR_PARITY = XS_DRIVER_STUFFUP,   /* Uncorrectable parity error 
occurred */
Index: aic7xxx_osm.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ic/aic7xxx_osm.c,v
retrieving revision 1.27
diff -u -p -u -r1.27 aic7xxx_osm.c
--- aic7xxx_osm.c       8 Apr 2008 12:07:25 -0000       1.27
+++ aic7xxx_osm.c       29 Jul 2009 16:27:42 -0000
@@ -787,7 +787,7 @@ ahc_timeout(void *arg)
                               scb->sg_list[i].len & AHC_SG_LEN_MASK);
                }
        }
-       if (scb->flags & (SCB_DEVICE_RESET|SCB_ABORT)) {
+       if (1 /* scb->flags & (SCB_DEVICE_RESET|SCB_ABORT) */) {
                /*
                 * Been down this road before.
                 * Do a full bus reset.


Home | Main Index | Thread Index | Old Index