Subject: kern/9811: adw(4) hang problem in -current?(i386)
To: None <gnats-bugs@gnats.netbsd.org>
From: None <smd@ebone.net>
List: netbsd-bugs
Date: 04/06/2000 09:32:10
>Number:         9811
>Category:       kern
>Synopsis:       disk accesses timeout and never recover
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Apr 06 02:55:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     Sean Doran
>Release:        current as of 1 Apr
>Organization:
	
>Environment:
	
System: NetBSD crasse.smd.ebone.net 1.4X NetBSD 1.4X (SCREAM) #0: Sat Apr 1 02:01:29 CEST 2000 smd@crasse.smd.ebone.net:/usr/src/sys/arch/i386/compile/SCREAM i386

adw0 at pci0 dev 19 function 0: AdvanSys ASB-3940U2W SCSI adapter
adw0: interrupting at irq 10
scsibus2 at adw0: 16 targets, 8 luns per target
...
scsibus2: waiting 2 seconds for devices to settle...
sd3 at scsibus2 target 1 lun 0: <IBM, DGHS18D, 03E0> SCSI3 0/direct fixed
sd3: 17501 MB, 8154 cyl, 20 head, 219 sec, 512 bytes/sect x 35843670 sectors
sd4 at scsibus2 target 6 lun 0: <QUANTUM, QM318000TD-SW, N491> SCSI2 0/direct fi
xed
sd4: 17366 MB, 8057 cyl, 20 head, 220 sec, 512 bytes/sect x 35566500 sectors

>Description:
	sd3(adw0:1:0): timed out
	sd3(adw0:1:0): timed out
	sd3(adw0:1:0): timed out
and that's all she wrote
all accesses to the timed-out disk (sd3 and sometimes sd4) simply
block, as seen below, after dropping, issuing kill 0t1 and c.

scsictl has no effect.

cycling power (unplug sca<->lvd 68pin converter, wait, replug) on disk
does nothing or triggers:

	sd3: respinning up disk
	sd3(adw0:1:0): timed out

Occasionally the disk will hang with the disk busy LED in on state;
usually not.

The hangs occur most often after the machine has been up some hours,
and so far never under particularly heavy load.

envstat shows nothing unusual thermally, and a hit of the reset
button or a quick power cycle of the entire machine will always
result in a perfectly happy controller/disk combination, for many hours.

I have done nothing unusual configuration-wise to the adaptor card,
and since the disks run normally for long periods of time under mixed
loads, I find it hard to think of how to blame hardware.

Unfortunately, this gives me maximal uptimes around 30 hours,
since I cannot recover from the timed out disk without a(n unclean) shutdown.

  UID   PID  PPID CPU PRI NI   VSZ  RSS WCHAN  STAT TT       TIME COMMAND
    0     0     0   0 -18  0     0 15516 schedu DLs  ??    0:00.03 (swapper)
    0     1     0   0  10  0   276  240 wait   Is   ??    0:00.01 init 
    0     2     0   0  10  0     0 15516 apmev  DL   ??    0:00.57 (apm0)
    0     3     0   0 -18  0     0 15516 daemon DL   ??    0:00.00 (pagedaemon)
    0     4     0   0 -18  0     0 15516 reaper DL   ??    0:00.11 (reaper)
    0     5     0   0  18  0     0 15516 syncer DL   ??    0:01.79 (ioflush)
    0   196     1   0  -2  0   256  608 vnlock Ds   ??    0:00.22 /usr/pkg/sbin
 3005  3267     1   0  -2  0  1760 2540 vnlock D    p0-   0:00.00 /usr/X11R6/bi
 3005   282     1   0  -2  0 15328 14164 vnlock D    p3-   1:47.11 /usr/pkg/lib/
 3005   283   282   0   2  0     0    0 -      Z    p3-   0:00.00 (netscape)
 3005   483     1  30  -5  0 16852 17156 scsipi D    p3- 333:19.41 /usr/pkg/lib/
 3005   484   483   0   2  0     0    0 -      Z    p3-   0:00.00 (netscape)
 3005   611     1   0  -2  0 12456 11620 vnlock D    p3-   0:17.51 /usr/pkg/lib/
 3005   612   611   0   2  0     0    0 -      Z    p3-   0:00.00 (netscape)
 3005  1909     1   0  -5  0 19568 20752 biowai D    p3-   0:19.69 /usr/pkg/lib/
 3005  1910  1909   0   2  0     0    0 -      Z    p3-   0:00.00 (netscape)
    0  3266     1   0  -2  0   364  240 vnlock D    p5-   0:00.01 -csh 
 3005   757     1   0  -5  0   528  296 biowai Ds+  p7    0:00.03 es 
 3005  1081   757  29  -2  4 22584 23044 vnlock DNE  p7  303:54.65 /usr/pkg/lib/
 3005  1082  1081   0  28  0     0    0 -      Z    p7    0:00.00 (netscape)
 3005  1567   757   3  31  0     0    0 -      Z    p7    0:00.00 (netscape)
    0   206     1   5  -2  0   632  276 vnlock D    E0-  68:34.98 ./rc5des  
    0  3268     1   0  -2  0    24  104 vnlock D    E0-   0:00.00 /usr/bin/su 
    0  3269     1   0  10  0   396  192 wait   Ss   E0    0:00.00 /bin/sh 
    0  3271  3269   0  28  0   312  188 -      R+   E0    0:00.00 ps -axl 


>How-To-Repeat:
	boot
	run normally 
	one of the disks hanging off the adw(4) controller times out
	try to access it
	see process hang
	
	particularly fun when the disk that times out has /usr/pkg
	and /usr/pkg/sbin is touched by root's csh startup rehash... -:(
	
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted: