Subject: port-pmax/5510: asc scsi driver broken in 1.3.1
To: None <gnats-bugs@gnats.netbsd.org>
From: None <greg@stevie.egr.unlv.edu>
List: netbsd-bugs
Date: 05/29/1998 15:54:40
>Number:         5510
>Category:       port-pmax
>Synopsis:       asc scsi driver broken in 1.3.1
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    gnats-admin (GNATS administrator)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri May 29 16:05:00 1998
>Last-Modified:
>Originator:     Greg Wohletz
>Organization:
UNLV
>Release:        NetBSD-1.3.1
>Environment:
DECstation 5000/200, NetBSD-1.3.1
System: NetBSD stevie.egr.unlv.edu 1.3.1 NetBSD 1.3.1 (WWW) #1: Thu Apr 23 13:32:53 PDT 1998 greg@www2.egr.unlv.edu:/amd/bb.cs.unlv.edu./sd0/src1/OS/NetBSD/NetBSD-1.3.1/src/sys/arch/pmax/compile/WWW pmax


>Description:
	There is a problem with the asc driver in 1.3.1 that appeared when
	it was changed to do DMA.  Below is a summary of various email
	messages I have exchanged with Michael Hitch regarding the problem.
	It is quite long, but I figured it was best to provide the complete
	history for the database.

Recently upgraded a DECstation 5000/200 from 1.2G to 1.3.1, and am now getting alot
of these:

Apr  9 08:03:44 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
Apr  9 08:03:44 lazy /netbsd: rz0: Illegal request, blk 3452431
Apr  9 08:03:44 lazy /netbsd: ccd0: error 5 on component 0
Apr  9 08:03:52 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
Apr  9 08:03:52 lazy /netbsd: rz0: Illegal request, blk 3336336
Apr  9 08:03:52 lazy /netbsd: ccd0: error 5 on component 0
Apr  9 08:14:50 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
Apr  9 08:14:50 lazy /netbsd: rz0: Illegal request, blk 2559904
Apr  9 08:14:50 lazy /netbsd: ccd0: error 5 on component 0
Apr  9 08:32:10 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
Apr  9 08:32:11 lazy /netbsd: rz0: Illegal request, blk 3082688
Apr  9 08:32:11 lazy /netbsd: ccd0: error 5 on component 0
Apr  9 08:38:08 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
Apr  9 08:38:08 lazy /netbsd: rz0: Illegal request, blk 2046225
Apr  9 08:38:08 lazy /netbsd: ccd0: error 5 on component 0

If I compile 1.3.1 replacing asc.c, asc_tc.c and ascvar.h with the 1.2G versions
the system works just fine (it is our news server so it does alot of disk IO)


Michael Hitch recomended compiling a kernel with DEBUG turned on, but kernels
compiled with DEBUG cause the system to crash with a ``asc_timeout''. (Michael
has verified this).

The version of the files that run well are:

/*      $NetBSD: asc.c,v 1.39 1997/06/16 03:46:29 jonathan Exp $        */
/*      $NetBSD: asc_tc.c,v 1.5 1997/06/22 07:44:05 jonathan Exp $      */
/*      $NetBSD: ascvar.h,v 1.1 1996/09/25 21:07:56 jonathan Exp $      */


Later Michael sent:

Hi, Greg.

While working on the RD42 problem, I found something that looks like it's
related to a collsion between a select request and bus reselection.  I'm
not certain it's related to the problem you are seeing, but here's a set
of diffs I would like you to try out if you get a chance.  This should be
run with "options ASC_DIAGNOSTIC" (or #define ASC_DIAGNOSTIC in asc.c) to
get a little more information.

  This diff also includes a specific check for an illegal command that
seems to have disappeared (if it was ever there in the first place).  It
will reboot the system immediately (after an error message) instead of
waiting until a timeout.  [This won't help the problem with DEBUG defined,
but at least it won't be a 'timeout' anymore.]

Thanks,
Michael

--- asc.c.save  Sun Apr 26 12:03:56 1998
+++ asc.c       Sun Apr 26 12:07:26 1998
@@ -1060,13 +1090,15 @@
                        } else
                                state->script = asc->script;
                } else if (state->flags & DMA_IN) {
+#ifdef ASC_DIAGNOSTIC
                        if (len) {
-#ifdef DEBUG
-                               printf("asc_intr: 1: bn %d len %d (fifo %d)\n",
-                                       asc_debug_bn, len, fifo); /* XXX */
-#endif
-                               goto abort;
+                               printf("asc_intr: 1: len %d (fifo %d)\n",
+                                       len, fifo); /* XXX */
+                               printf("  flags %x dmalen %d status %x ir %x\n"
,
+                                   state->flags, state->dmalen, status, ir);
+/*                             goto abort;*/
                        }
+#endif
                        /* setup state to resume to */
                        if (state->flags & DMA_IN_PROGRESS) {
                                len = state->dmalen;
@@ -1208,12 +1240,14 @@
                /* Should process reselect? */
        }
 
+#if 1
        /* mhitch - check for illegal command */
        if (ir & ASC_INT_ILL) {
                printf("asc_intr: Illegal command status %x ir %x cmd %x\n",
                    status, ir, regs->asc_cmd);
                goto abort;     /* XXX */
        }
+#endif
 
        /* check for reselect */
        if (ir & ASC_INT_RESEL) {
@@ -1226,6 +1260,13 @@
 
                        goto abort;
                }
+#ifdef ASC_DIAGNOSTIC
+               if (fifo > 2) {
+                       printf("asc_intr: target %d, reselect, fifo %d\n",
+                           asc->target, fifo);
+                       /* display extra fifo data */
+               }
+#endif
                /* read unencoded SCSI ID and convert to binary */
                msg = regs->asc_fifo & asc->myidmask;
                for (id = 0; (msg & 1) == 0; id++)
@@ -1238,6 +1279,7 @@
                else
                        asc_logp[-1].msg = msg;
 #endif
+ASC_TC_PUT(regs,0);    /* ensure TC clear? */
                asc->state = ASC_STATE_BUSY;
                asc->target = id;
                state = &asc->st[id];


The patch ended up having a bug in it and he sent this addendum patch:

--- asc.c.orig  Sun Apr 26 19:22:33 1998
+++ asc.c       Sun Apr 26 22:37:34 1998
@@ -906,7 +913,8 @@
                state = &asc->st[asc->target];
                switch (ASC_PHASE(status)) {
                case SCSI_PHASE_DATAI:
-                       if ((asc->script - asc_scripts) == SCRIPT_GET_STATUS) {
+                       if ((asc->script - asc_scripts) == SCRIPT_DATA_IN + 1 |
|
+                           (asc->script - asc_scripts) == SCRIPT_CONTINUE_IN) 
{
                                /*
                                 * From the Mach driver:
                                 * After a reconnect and restart dma in, we

However this patch didn't end up fixing the problem:

On Apr 27, 11:07am, Greg Wohletz wrote:
> OK, took it about 10 minutes more to get the illegal command, here is the log
:
> 
> asc_intr: Illegal command status 87 ir 40 cmd 11

  This is where I'm currently stuck - no fix at the present.

  All my Illegal commands had "cmd 10", which is a "Transfer Information"
command.  The 11 is and "Initiator Command Complete Sequence", used by the
53c94 to get the status and message byte.

  One thing I was thinking of trying was to disable disconnects and see
if that had any affect.  I think that can be accomplished by changing
SCSI_DIS_REC_IDENTIFY to SCSI_IDENTIFY.  That would eliminate any problems
the driver might have with collisions between selecting devices.

-- 
Michael L. Hitch                        mhitch@montana.edu


Then this:

>How-To-Repeat:
	Do heavy disk IO on a 5000/200
>Fix:
	None yet.
>Audit-Trail:
>Unformatted:
>>  One thing I was thinking of trying was to disable disconnects and see
>>if that had any affect.  I think that can be accomplished by changing
>>SCSI_DIS_REC_IDENTIFY to SCSI_IDENTIFY.  That would eliminate any problems
>>the driver might have with collisions between selecting devices.
>
>After running for about 2 hours it did the thing where it goes into an
>infinite loop printing:
>
>SENSE
>SENSE
>SENSE
>SENSE
>SENSE
>SENSE
>SENSE
>SENSE
>
>
>unfortunately this scrolled any diagnostic output that might have been
>printed off the screen and out of the dmesg buffer long before I could hit
>the reset button.  Have you seen that?
>
>                                       --Greg

Looks like the following from rz.c is where it gets into the tight loop:

        while (sc->sc_flags & RZF_SENSEINPROGRESS)
                printf("SENSE\n"); /* XXX */




Then a while later:

OK, I've worked on the "illegal command" problem with DEBUG defined, and
think I have a work-around.  I still haven't figured out why it is
occuring, but I have a fix that seems to get around the problem [at least
on my system].

Here's the complete set of diffs I'm currently running with.  It currently
prints out the illegal command message if ASC_DIAGNOSTIC is set, so it
can get a little verbose.  When hitting my RD42 hard as well as an RZ24,
I get quite a few of those messages, but everything seems to run OK otherwise.

If you could give these diffs a try within a couple of days, I will try to
get them into the 1.3.2 patches.

Thanks,
Michael

--- /sys/dev/tc/asc.c.orig      Sun Apr 26 19:22:33 1998
+++ /sys/dev/tc/asc.c   Tue May  5 11:25:18 1998
@@ -830,6 +830,7 @@
        register script_t *scpt;
        register int ss, ir, status;
        register unsigned char cmd_was;
+/*mhitch*/ static int ill_cmd_count = 0;
 
        /* collect ephemeral information */
        status = regs->asc_status;
@@ -906,7 +907,8 @@
                state = &asc->st[asc->target];
                switch (ASC_PHASE(status)) {
                case SCSI_PHASE_DATAI:
-                       if ((asc->script - asc_scripts) == SCRIPT_GET_STATUS) {
+                       if ((asc->script - asc_scripts) == SCRIPT_DATA_IN + 1 |
|
+                           (asc->script - asc_scripts) == SCRIPT_CONTINUE_IN) 
{
                                /*
                                 * From the Mach driver:
                                 * After a reconnect and restart dma in, we
@@ -1054,11 +1056,8 @@
                                state->script = asc->script;
                } else if (state->flags & DMA_IN) {
                        if (len) {
-#ifdef DEBUG
-                               printf("asc_intr: 1: bn %d len %d (fifo %d)\n",
-                                       asc_debug_bn, len, fifo); /* XXX */
-#endif
-                               goto abort;
+                               printf("asc_intr: 1: len %d (fifo %d)\n",
+                                       len, fifo); /* XXX */
                        }
                        /* setup state to resume to */
                        if (state->flags & DMA_IN_PROGRESS) {
@@ -1201,6 +1200,21 @@
                /* Should process reselect? */
        }
 
+       /* mhitch - check for illegal command */
+       if (ir & ASC_INT_ILL) {
+#ifdef ASC_DIAGNOSTIC
+               printf("asc_intr: Illegal command status %x ir %x cmd %x ? %x\n
",
+                   status, ir, regs->asc_cmd, asc_scripts[SCRIPT_MSG_IN].comma
nd);
+#endif
+/*mhitch*/     if (ill_cmd_count++ != 3) {
+/*mhitch*/             regs->asc_cmd = ASC_CMD_MSG_ACPT;
+/*mhitch*/             readback(regs->asc_cmd);
+/*mhitch*/             goto done;
+/*mhitch*/     }
+               goto abort;     /* XXX */
+       }
+/*mhitch*/ ill_cmd_count = 0;
+
        /* check for reselect */
        if (ir & ASC_INT_RESEL) {
                unsigned fifo, id, msg;
@@ -1224,6 +1238,11 @@
                else
                        asc_logp[-1].msg = msg;
 #endif
+               /*
+                * TC may have been initialized during a selection attempt.
+                * Clear it to prevent possible confusion later.
+                */
+               ASC_TC_PUT(regs,0);     /* ensure TC clear */
                asc->state = ASC_STATE_BUSY;
                asc->target = id;
                state = &asc->st[id];



I tried this code and:

OK, the thing kept getting those Illegal request errors till eventually the
filesystem appeared to become corrupt (presumably due to a write failure or
some such, here is the log:

May  6 11:37:39 lazy /netbsd: asc_intr: Illegal command status 87 ir 40 cmd 11 
? 12
May  6 11:37:43 lazy /netbsd: asc_intr: Illegal command status 87 ir 40 cmd 11 
? 12
May  6 11:38:57 lazy /netbsd: asc_intr: ignoring strange interrupt tc 3018 fifo
 residue 7
May  6 11:39:53 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
May  6 11:39:53 lazy /netbsd: rz1: Illegal request, blk 2681550
May  6 11:39:54 lazy /netbsd: ccd0: error 5 on component 1
May  6 11:42:25 lazy /netbsd: asc_intr: ignoring strange interrupt tc 8142 fifo
 residue 4
May  6 11:42:32 lazy /netbsd: asc_intr: ignoring strange interrupt tc 8142 fifo
 residue 4
May  6 11:44:38 lazy /netbsd: asc_intr: ignoring strange interrupt tc 8142 fifo
 residue 4
May  6 11:46:54 lazy /netbsd: asc_intr: Illegal command status 87 ir 40 cmd 11 
? 12
May  6 11:49:25 lazy /netbsd: asc_intr: ignoring strange interrupt tc 974 fifo 
residue 3
May  6 11:49:39 lazy /netbsd: asc_intr: ignoring strange interrupt tc 8136 fifo
 residue 4
May  6 11:50:05 lazy /netbsd: asc_intr: Illegal command status 87 ir 40 cmd 11 
? 12
May  6 11:50:44 lazy /netbsd: asc_intr: ignoring strange interrupt tc 974 fifo 
residue 3
May  6 11:51:08 lazy /netbsd: asc_intr: Illegal command status 87 ir 40 cmd 11 
? 12
May  6 11:52:52 lazy /netbsd: asc_intr: ignoring strange interrupt tc 8140 fifo
 residue 3
May  6 11:54:00 lazy /netbsd: asc_intr: Illegal command status 87 ir 40 cmd 11 
? 12
May  6 12:00:22 lazy /netbsd: asc_intr: ignoring strange interrupt tc 8142 fifo
 residue 4
May  6 12:01:33 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
May  6 12:01:35 lazy /netbsd: SENrz0: Illegal request, blk 2966426
May  6 12:01:35 lazy /netbsd: ccd0: error 5 on component 0
May  6 12:01:35 lazy /netbsd: SE
May  6 12:01:36 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
May  6 12:01:37 lazy /netbsd: rz0: Illegal request, blk 847456
May  6 12:01:37 lazy /netbsd: ccd0: error 5 on component 0
May  6 12:02:56 lazy /netbsd: asc_get_status: cmdreg 11, fifo cnt 7
May  6 12:02:57 lazy /netbsd: rz0: Illegal request, blk 1261711
May  6 12:02:58 lazy /netbsd: ccd0: error 5 on component 0
May  6 12:08:15 lazy /netbsd: mode = 0100664, inum = 1213027, fs = /var/news
May  6 12:08:15 lazy /netbsd: panic: ffs_valloc: dup alloc

either of us.