port-sparc: Re: SCSI probs on spork 10

Subject: Re: SCSI probs on spork 10
To: Jim Bernard <jbernard@mines.edu>
From: john heasley <heas@shrubbery.net>
List: port-sparc
Date: 06/20/2001 08:09:35
Tue, Jun 19, 2001 at 08:51:14PM -0600, Jim Bernard:
> On Tue, Jun 19, 2001 at 04:53:20PM -0500, NetBSD list wrote:
> > cvs versions beyond 1.5V of the NetBSD kernel hang on my Sparc 10.
> > 
> > The last working kernel I have is:
> > NetBSD dudley 1.5V NetBSD 1.5V (MRYSPARC) #3: Tue May 15 14:58:39 CDT 2001     root@dudley:/usr/src/sys/arch/sparc/compile/MRYSPARC sparc
> > 
> > Since sometime after that date, disk activity will hang (I'm guessing)
> > the SCSI bus.  A typical way I can cause this to happen is by cvs'ing
> > the source tree.  In about 5 seconds, the machine stops all disk access.
> > No errors are produced.  Sessions continue to echo characters, but no
> > commands are executed.  It's just off in la-la land (much as I often
> > aspire to be).
> > 
> > Anyone else having such an issue?  I've cleared out /usr/src many
> > times in the last month since then and started over, but the problem
> > persists here.  The kernel I'm running is GENERIC modified for 128 users.
> > 
> > Please help guide me in tracking this down if possible.  Meanwhile,
> > the 1.5V is working fine with the latest cvs userland.  Help with
> > the kernel debugger would be nice, as I'm not familiar with debugging
> > NetBSD kernels.
> 
>   Same here, also on a sparc 20.  With recent kernels I occasionally see
> scsi parity errors on one disk (can't seem to find a real hardware fault,
> though), and eventually the system just hangs.  Some things continue to
> work for a while after others stop working (logins via ssh seem to be one
> of the first to go, while sendmail continues to work much longer).  I also
> found that if I happened to be logged in while it was in this state,
> it would eventually just not execute commands, and after a bit more time
> even window focus changes would stop working.  The last working kernel
> I have is 1.5U from mid April.  The first one on which I observed the
> failure was 1.5V from May 29.  Unfortunately, that's all the info I
> have on the problem so far.
> 
>   BTW: I noticed that the most recent working kernel shows tagged queuing
> rejected on all the disks, e.g.:
> 
> sd0 at scsibus0 target 0 lun 0: <SEAGATE, ST34555N, 0930> SCSI2 0/direct fixed
> sd0(esp0:0:0): max sync rate 10.00MB/s
> esp0: tagged queuing rejected: target 0
> 
> whereas the problematic kernel I built June 16 shows it enabled:
> 
> sd0 at scsibus0 target 0 lun 0: <SEAGATE, ST34555N, 0930> SCSI2 0/direct fixed
> sd0: 4340 MB, 6300 cyl, 8 head, 176 sec, 512 bytes/sect x 8888924 sectors
> sd0: sync (100.0ns offset 15), 8-bit (10.000MB/s) transfers, tagged queueing
> 
> I don't know whether this is related to the problem.
> 
> --Jim

though this doesn't look quite the same as the parity errors i had with an
ibm scsi 3 drive on a FAS interface (sunW,fas or sparc64 built-in).  i fixed
it by making sure (i dont think my change is complete - but have not had
time recently) WIDE negotiation occurs before SYNC negotiation, as follows:

Index: ncr53c9x.c
===================================================================
RCS file: /cvsroot/syssrc/sys/dev/ic/ncr53c9x.c,v
retrieving revision 1.80
diff -c -r1.80 ncr53c9x.c
*** ncr53c9x.c	2001/05/23 18:32:26	1.80
--- ncr53c9x.c	2001/06/20 15:06:35
***************
*** 510,516 ****
   * NCR_INTR - so make sure it is the last read.
   *
   * I think that (from reading the docs) most bits in these registers
!  * only make sense when he DMA CSR has an interrupt showing. Call only
   * if an interrupt is pending.
   */
  __inline__ void
--- 515,521 ----
   * NCR_INTR - so make sure it is the last read.
   *
   * I think that (from reading the docs) most bits in these registers
!  * only make sense when the DMA CSR has an interrupt showing. Call only
   * if an interrupt is pending.
   */
  __inline__ void
***************
*** 1793,1805 ****
  				break;
  
  			case MSG_EXT_WDTR:
! 				printf("%s: wide mode %d\n",
! 				       sc->sc_dev.dv_xname, sc->sc_imess[3]);
! 				if (sc->sc_imess[3] == 1) {
! 					ti->cfg3 |= NCRFASCFG3_EWIDE;
  					ncr53c9x_setsync(sc, ti);
! 				} else
! 					ti->width = 0;
  				ti->flags &= ~T_WIDE;
  				break;
  			default:
--- 1798,1817 ----
  				break;
  
  			case MSG_EXT_WDTR:
! 				printf("%s: %d bit mode\n",
! 					sc->sc_dev.dv_xname,
! 					sc->sc_imess[3] == 1 ? 16 : 
! 					sc->sc_imess[3] == 2 ? 32 : 8);
! 			printf("%s: ti->flags & T_WIDE = %d, ti->width = %d\n",
! 					sc->sc_dev.dv_xname,
! 					ti->flags & T_WIDE, ti->width);
! 				if (ti->flags & T_WIDE) {
! 					ti->width = sc->sc_imess[3];
! 					if (sc->sc_imess[3] != 0)
! 						ti->cfg3 |= NCRFASCFG3_EWIDE;
  					ncr53c9x_setsync(sc, ti);
! 					ncr53c9x_sched_msgout(SEND_WDTR);
! 				}
  				ti->flags &= ~T_WIDE;
  				break;
  			default: