NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-i386/41706: disk subsystem unresponsive after (recovered) disk failure



The following reply was made to PR port-i386/41706; it has been noted by GNATS.

From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: port-i386-maintainer%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost,
        netbsd-bugs%NetBSD.org@localhost
Subject: Re: port-i386/41706: disk subsystem unresponsive after (recovered)
        disk failure
Date: Wed, 29 Jul 2009 18:43:27 +0200

 --AhhlLboLdkugWU4S
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 On Tue, Jul 28, 2009 at 09:58:59PM +0200, Manuel Bouyer wrote:
 > On Sun, Jul 12, 2009 at 03:05:00PM +0000, bad%bsd.de@localhost wrote:
 > > >Description:
 > >    
 > > sd1 failed on the above system a couple of days ago.  What I could see
 > > on the console were the messages from ahc1 being reset.  sd1 became
 > > unready and would no longer respond positivly to a TEST UNIT READY command
 > > (firmware diagnostic failure given as the reason).
 > > 
 > > The system sat there for 2 more days without further kernel messages.
 > > Pressing return on the console would produce a new login prompt from getty.
 > > The system was pingable and did accept TCP connections (e.g. to the SSH 
 > > port).
 > > But no disk IO would happen and no error messages were printed.
 > > IOW. the block IO subsystem seems to have been deadlocked at a high level.
 > 
 > This is an issue with timeouts in the ahc driver (I found with a tape drive
 > where some mt or chio operation would take too long). I have a patch for this
 > (on a powered down system, I'll have a look tomorow).
 > from memory, the workaround was to not send BDR message and directly do a
 > bus reset.
 
 Attached is the patch I used. I also fixed the value of
 CAM_CMD_TIMEOUT so it doens't match XS_TIMEOUT only by accident.
 
 -- 
 Manuel Bouyer, LIP6, Universite Paris VI.           
Manuel.Bouyer%lip6.fr@localhost
      NetBSD: 26 ans d'experience feront toujours la difference
 --
 
 --AhhlLboLdkugWU4S
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename="aic_rst.diff"
 
 Index: aic7xxx_cam.h
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/aic7xxx_cam.h,v
 retrieving revision 1.4
 diff -u -p -u -r1.4 aic7xxx_cam.h
 --- aic7xxx_cam.h      14 Mar 2006 15:24:30 -0000      1.4
 +++ aic7xxx_cam.h      29 Jul 2009 16:27:42 -0000
 @@ -71,7 +71,7 @@ typedef enum {
        CAM_REQ_INVALID = XS_DRIVER_STUFFUP,    /* CCB request was invalid */
        CAM_PATH_INVALID,                       /* Supplied Path ID is invalid 
*/
        CAM_SEL_TIMEOUT = XS_SELTIMEOUT,        /* Target Selection Timeout */
 -      CAM_CMD_TIMEOUT,                        /* Command timeout */
 +      CAM_CMD_TIMEOUT = XS_TIMEOUT,           /* Command timeout */
        CAM_SCSI_STATUS_ERROR,                  /* SCSI error, look at error 
code in CCB */
        CAM_SCSI_BUS_RESET = XS_RESET,          /* SCSI Bus Reset Sent/Received 
*/
        CAM_UNCOR_PARITY = XS_DRIVER_STUFFUP,   /* Uncorrectable parity error 
occurred */
 Index: aic7xxx_osm.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/ic/aic7xxx_osm.c,v
 retrieving revision 1.27
 diff -u -p -u -r1.27 aic7xxx_osm.c
 --- aic7xxx_osm.c      8 Apr 2008 12:07:25 -0000       1.27
 +++ aic7xxx_osm.c      29 Jul 2009 16:27:42 -0000
 @@ -787,7 +787,7 @@ ahc_timeout(void *arg)
                               scb->sg_list[i].len & AHC_SG_LEN_MASK);
                }
        }
 -      if (scb->flags & (SCB_DEVICE_RESET|SCB_ABORT)) {
 +      if (1 /* scb->flags & (SCB_DEVICE_RESET|SCB_ABORT) */) {
                /*
                 * Been down this road before.
                 * Do a full bus reset.
 
 --AhhlLboLdkugWU4S--
 


Home | Main Index | Thread Index | Old Index