Subject: kern/13817: scsipi tagged queueing versus an ST15150N
To: None <gnats-bugs@gnats.netbsd.org>
From: None <kwellsch@tampabay.rr.com>
List: netbsd-bugs
Date: 08/29/2001 09:06:08
>Number:         13817
>Category:       kern
>Synopsis:       scsipi offers tagged queueing - an ST15150N accepts but breaks
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    kern-bug-people
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Wed Aug 29 06:02:00 PDT 2001
>Closed-Date:
>Last-Modified:
>Originator:     Ken Wellsch
>Release:        NetBSD-1.5X (-current)
>Organization:
>Environment:
Various NetBSD architectures
System: NetBSD arundel.fortyfour.org 1.5X NetBSD 1.5X (ARUNDEL) #0: Mon Aug 27 14:07:17 EDT 2001 kwellsch@arundel.fortyfour.org:/fsys/src/sys/arch/i386/compile/ARUNDEL i386
Architecture: i386
Machine: i386
>Description:

I recently purchased some used 4Gb Seagate ST15150N disks and
ran into problems on most systems.  NetBSD like Solaris quickly
ends up hung - Solaris generates timeouts while -current talks
about "device busy" or a similar error.

Under FreeBSD, the driver is able to recover after a whole mess
of error noise.  Under Linux I didn't see the problem, but I claim
that is likely because they have a quirk entry or do not yet use
tagged queueing (with ahc).

At first I didn't understand the problem for what I now believe
it is.  As said above, on i386 -current the disk could not be talked
to.  It would probe but I could not touch it once the system was
up and running.  This was using an ahc (7880) controller.

On an alpha with a Qlogic 1040a controller I do not see the problem.
I'm thinking that is because the driver either is adaptive and backs
down on the max queue limit, or can back out of tagged queueing
completely.

On a Sparc box, I see the i386 behavior, but by accident, I found
if I did a "scsictl scsibus0 scan any any" that this drive then
rejects tagged queueing after previously accepting it on boot. 
I can then use the drive without a problem.

On a MacPPC box I do not see the problem but that is likely because
the driver doesn't do tagged queuing.

After wasting hours installing Solaris 2.5.1 on one of these disks
on an old slow Sparc box then not being able to boot afterward,
I read this in the Solaris Manager FAQ:

  Tagged Command Queueing (TCQ) is an optional part of the SCSI-2
  specification. It permits a drive to accept multiple I/O requests
  for execution later. These requests are "tagged" by a reusable
  id so that the drive and the OS can keep track of them. The drive
  can reorder these requests to optimize seeks. For more details,
  see the SCSI-2 specifications. A draft version is available at
  ftp://ftp.cs.toronto.edu/pub/jdd/scsi-doc/scsi2.10b.gz

  SunOS 4.x and earlier never uses tagged queueing. However, Solaris
  2.x will make use of tagged queuing if the drive claims to support
  it. Unfortunately, some drive manufacturers have found it hard
  to design their drives to do tagged queueing properly, and this
  particular area has been a common source of bugs in drive firmware.

  If it is not possible to turn off tagged queueing in the drive
  that is causing the problem, Solaris 2.x can be told not to use
  tagged queueing at all, by putting the following line in
  /etc/system:

	set scsi_options & ~0x80 

  The "scsi_options" kernel variable contains a number of bit flags
  which are defined in /usr/include/sys/scsi/conf/autoconf.h. 0x80
  corresponds to tagged queueing.

  However, this turns off tagged queueing for the entire machine,
  not just the problematic drive. Because tagged queueing can provide
  a significant performance enhancement for busy drives, this may
  not always be desirable. In Solaris 2.4 and later, it is possible
  to disable tagged queueing and set or clear other scsi options
  on a per-controller or per-drive basis. The appropriate technique
  is described in the esp(7) and isp(7) man pages.

I had not realized there may be a generation of SCSI drives out
there with suspect tagged queueing "support."  I know the Seagate
manual for this drive claims it can handle up to 64 queue events.

>How-To-Repeat:

Find an ST15150N drive and attempt to use it with -current.

>Fix:

So I just added it to the quirk table and have had no problems.

--- /usr/src/sys/dev/scsipi/scsiconf.c  Wed Jul 18 16:19:24 2001
+++ /tmp/scsiconf.c     Sun Aug 12 15:30:25 2001
@@ -544,6 +544,8 @@
        {{T_DIRECT, T_FIXED,
         "SEAGATE ", "ST125N          ", ""},     PQUIRK_NOLUNS},
        {{T_DIRECT, T_FIXED,
+        "SEAGATE ", "ST15150N        ", ""},     PQUIRK_NOTAG},
+       {{T_DIRECT, T_FIXED,
         "SEAGATE ", "ST157N          ", ""},     PQUIRK_NOLUNS},
        {{T_DIRECT, T_FIXED,
         "SEAGATE ", "ST296           ", ""},     PQUIRK_NOLUNS},

>Release-Note:
>Audit-Trail:
>Unformatted: