netbsd-bugs: Re: ncr53c810 pci scsi driver hangs system frequently

Subject: Re: ncr53c810 pci scsi driver hangs system frequently
To: James E. Bernard <jbernard@geek.mines.edu>
From: Stefan Esser <se@zpr.uni-koeln.de>
List: netbsd-bugs
Date: 03/15/1996 22:57:38
On Mar 15, 14:25, James E. Bernard wrote:
} System: NetBSD zoo 1.1 NetBSD 1.1 (ZOO) #0: Sun Dec 31 21:06:09 MST 1995 local@zoo:/home/local/netbsd-1.1/usr/src/sys/arch/i386/compile/ZOO i386
} The cpu is a 100 MHz Pentium.
} SCSI devices include: Quantum Atlas XP32150 and Toshiba XM-3601 CD-ROM drive:
} /netbsd: ncr0 targ 0 lun 0: <Quantum, XP32150, 81HB> SCSI2 0/direct fixed
} /netbsd: sd0 at scsibus0sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
} /netbsd: : 2050MB, 3907 cyl, 10 head, 107 sec, 512 bytes/sec
} /netbsd: ncr0 targ 2 lun 0: <TOSHIBA, CD-ROM XM-3601TA, 0725> SCSI2 5/cdrom removable
} /netbsd: cd0 at scsibus0cd0(ncr0:2:0): asynchronous.
} Root, swap, /usr, and /var are on the Atlas.  User files are nfs mounted from
} another machine.  An ide disk is present and mounted, but not normally
} accessed.
} 
} 
} >Description:
} 	From time to time (varying from a minimum of about 1/2 hour to a maximum
} 	of about 1 week) the scsi driver goes into a loop, with the following
} 	error messages printed on the console:
} 
} 	  assertion "cp" failed: file "../../../../dev/pci/ncr.c", line 5577
} 	  sd0(ncr0:0:0): COMMAND FAILED (4 28) @f87d2800.
} 
} 	repeatedly (it does this forever).  In this state, the disk (controller)
} 	activity light is on continuously, and no action involving the disk can
} 	be taken until the system is restarted.  Note that it is not clear which
} 	of the messages above comes first, since I've never been present and
} 	watching the console (i.e., not running X) when the problem starts.
} 	Also, it is difficult to read the messages, since they overwrite each
} 	other so fast, but I think they are correct.  (Nothing, of course, is
} 	written to the console log on disk.)

I'm using the same drive in my system, and never
observed that behaviour. The cause of the command
failure seems to be a QUEUE FULL condition, as you
write below.

Since only 4 tags are generally used, this is an
"impossible" situation. The drive supports more
than ten times as many simultanous commands ...

Could you try the latest driver version as found
in a FreeBSD-current source tree ?

You may want to disable tags as a workaround.
This can be done by applying the following patch
to /sys/pci/ncr.c (assuming it is found at the
same place under NetBSD and FreeBSD):

Index: /sys/pci/ncr.c
===================================================================
RCS file: /usr/cvs/src/sys/pci/ncr.c,v
retrieving revision 1.61
diff -C2 -r1.61 ncr.c
*** ncr.c	1996/01/23 21:47:12	1.61
--- ncr.c	1996/01/30 09:17:03
***************
*** 6222,6226 ****
  		np->jump_tcb.l_paddr = vtophys (&tp->jump_tcb);
  
! 		ncr_setmaxtags (tp, SCSI_NCR_MAX_TAGS);
  	}
  
--- 6222,6226 ----
  		np->jump_tcb.l_paddr = vtophys (&tp->jump_tcb);
  
! 		ncr_setmaxtags (tp, 0 /*SCSI_NCR_MAX_TAGS*/);
  	}

} 	I have not been able to associate this problem with any particular
} 	system activity.  Indeed, it has never happened while I was at the
} 	machine, and the machine is usually fairly inactive (except for uucp
} 	and pop mail transfers) when I am away from it.  I have copied as much
} 	as 1.5 GB of data to the disk in a fairly short time, with no problems,
} 	so disk activity does not seem to cause it.
} 
} 	Perusal of the ncr code turned up the following: the two numbers (4 28)
} 	refer to host status and scsi status, respectively, the former represented
} 	in the code by the cpp symbol HS_COMPLETE, and the latter by S_QUEUE_FULL.

I'm not sure what may trigger this situation.
It appears, as if there was a "tag leak". But 
the driver limits the number of tags used 
simultanously, and I can't think of a way to
circumvent that program logic ...

There are several timeout mechanisms that had
been put into the driver to improve stability.
But it has been found, that they can lead to
unexpected behaviour, if they actually are 
triggered. For this reason, most of them are
diabled or will soon be disabled ...

Somebody reported that the drive may choose to 
delay a write command by more than 5 seconds if
there are concurrent read requests. This may 
lead to a command timeout (the generic SCSI code
generally requests a timeout of 10 seconds for
a read, and I never heard of a case were a write
was delayed THAT long, but it might be the cause
of your problem ...).

Please let me know, whether the system works 
reliably with tags disabled, and if you can
build a kernel with the most recent driver
code, whether this improves things ...

Regards, STefan

PS: Please CC: all replies to my address, since 
    I'm not on the NetBSD-Bugs list ...

-- 
 Stefan Esser, Zentrum fuer Paralleles Rechnen		Tel:	+49 221 4706021
 Universitaet zu Koeln, Weyertal 80, 50931 Koeln	FAX:	+49 221 4705160
 ==============================================================================
 http://www.zpr.uni-koeln.de/~se			  <se@ZPR.Uni-Koeln.DE>