netbsd-bugs: Re: kern/21335 ahc -current 20030427 driver leaves process in D state after timeout/BDR

Subject: Re: kern/21335 ahc -current 20030427 driver leaves process in D state after timeout/BDR
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Reinoud Zandijk <reinoud@NetBSD.org>
List: netbsd-bugs
Date: 09/14/2006 02:30:03

The following reply was made to PR kern/21335; it has been noted by GNATS.

From: Reinoud Zandijk <reinoud@NetBSD.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/21335 ahc -current 20030427 driver leaves process in D state after timeout/BDR
Date: Thu, 14 Sep 2006 04:29:06 +0200

 Dear folks,

 Picking up this PR for a quick look after cries from Tracy, it seems like 
 the problem is not in src/sys/dev/scsipi/st.c as i first thought but in the 
 interaction between the scsipi framework and the ahc driver.

 The ahc driver contains a bug that gets 
 src/sys/dev/scsipi/scsipi_base.c:scsipi_execute_xs() in an endless loop 
 since the failed command is not getting (xs->xs_status & XS_STS_DONE) set. 
 This leaves the close() call to wait for ever.

 In the ahc driver src/sys/dev/ic/aic7xxx_osm.c:ahc_timeout() `magic' is 
 used to insert a bus reset (notification?) command at line 978 that places 
 the command in front of the offending command. I think though that this 
 code will only work when the queue is completely full; its not clear enough

 I think that in this piece of code the offending timed out SCSI call 
 somehow gets `lost' and since we're polling on it in scsipi_execute_xs() 
 resulting in an endless loop.

 The solution i'd see for the ahc driver is to pull over the changes made 
 by OpenBSD and FreeBSD to the driver's timeout code that tackels the 
 problem in a different way by using a QUEUE structure.

 Note that the 0x0E SCB's found in this PR are `vender specific' (function 
 unknown) and Tracy's 0x0F SCB is `READ REVERSE'(6?) that is specified as 
 *optional*. It might be that the software is getting signalled there is a 
 write error, then wants to read a bit back to see what went wrong/get a 
 token. If the drive is shabby it might not understand the command and thus 
 fail but i doubt that is the case; it is a possibility though.

 Regards,
 Reinoud