Subject: Re: 5000/200 filesystem hang
To: None <port-pmax@NetBSD.ORG>
From: Michael L. Hitch <osymh@lightning.oscs.montana.edu>
List: port-pmax
Date: 06/07/1996 09:03:13
On Jun  7,  1:51pm, Manuel Bouyer wrote:
} I have some problem with the -current kernel (with the new asc driver).
} The I/O operations hang. The processes still runs fine (under X-window, I
...
} This seems to be related to intense disk activity: a 'make -j4' kernel
} compile or a compile+tar backup via nfs.
} 
} Did anyone else noticed this ? Here is the dmesg output:

  I have seen one hang on the 5000/25 I've been using to test the asc driver.
I was running a systat -vmstat on one session, and when my other processes
stopped, the systat showed several processes in the 'd' state (disk wait?).
My guess is that the driver has gotten wedged someplace.  I haven't figured
out any easy way to troubleshoot this problem yet.  Since I've only seen
it once, it's pretty hard for me to duplicate.

} Beginning old-style SCSI device autoconfiguration
} rz2 at asc0 drive 2 slave 0 DEC RZ55     (C) DEC rev 1000, 649040 512 byte blocks
} rz3 at asc0 drive 3 slave 0 MAXTOR P0-12S rev HB18, 1999038 512 byte blocks

  I'd guess that the Maxtor drive is probably "newer" and faster than the
older DEC drives (I've been using RZ23's, RZ24, and RZ25).  Jonathan has
indicated to me that the faster SCSI-2 drives cause more problems with the
asc driver.  I doubt I will be able to do much with it unless I can get ahold
of some faster disks.

} Also, this kind of message is always followed by a brutal reboot (without
} panic).

  The "abort" process in the driver just does a boot(4) rather than a panic.
The panic probably wouldn't work anyway, since it's going to try to sync the
disk drives, but since the disk driver is in a non-usable state at that point,
it would either hang or loop.

} asc_intr: data overrun: buflen 2048 dmalen 2048 tc 1994 fifo 4

  This would seem to indicate that the driver got an interrupt in a state
it wasn't expecting.  It appears to have only transferred 54 bytes of data.

  The rest of the output is a dump of the last several states of the driver.
It's supposed to give some clue as to what happened prior to the current
condition.

} asc: asc_intr: cmd 28 bn 637084 cnt 4
                     ^          ^     ^
          read command   block no.    block count
...
} asc0 tgt 3 status 0 ss 0 ir 28 cond 0:118 msg c0 resid 2048
                              ^^
  This is where the read command starts.

} asc0 tgt 3 status 97 ss 8c ir 18 cond 0:118 msg c2 resid 0
} asc0 tgt 3 status 97 ss 8c ir 8 cond 9:708 msg 2 resid 0
} asc0 tgt 3 status 97 ss 8c ir 10 cond 10:710 msg 12 resid 0
} asc0 tgt 3 status 97 ss 8c ir 8 cond 9:708 msg 4 resid 0
} asc0 tgt 3 status 90 ss cc ir 20 cond 15:20 msg 12 resid 0
} asc0 tgt -1 status 97 ss 9c ir c cond 16:700 msg 80 resid 0
} asc0 tgt 3 status 91 ss 44 ir 10 cond 17:110 msg 12 resid 2048
} asc0 tgt 3 status 91 ss 4c ir 10 cond 1:310 msg 90 resid 0

  This represents the current interrupt state when the "abort" was done.

} Any idea ?

  I'll have to check over the last part of the output and try to figure
out what the driver was doing.

  I'm not sure if I will be able to do much about fixing problems like
these without having the hardware to duplicate the problems.

Michael

-- 
Michael L. Hitch			INTERNET:  osymh@montana.edu
Computer Consultant
Information Technology Center
Montana State University	Bozeman, MT	USA