Subject: kern/10583: adw(4) times out and hangs occasionally in -current, vnode cache frailty
To: None <>
From: None <>
List: netbsd-bugs
Date: 07/17/2000 21:50:49
>Number:         10583
>Category:       kern
>Synopsis:       adw(4) times out and hangs occasionally in -current, vnode cache frailty
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Jul 14 06:45:00 PDT 2000
>Originator:     Sean Doran
>Release:        cvs as of 2000 06 11
System: NetBSD 1.5B NetBSD 1.5B (SCREAM) #0: Tue Jul 11 20:27:03 CEST 2000 i386


There are two problems here: adw(4) timeout, and the results/side-effects
of the adw(4) recovery from the timeout that point to a frailty in the
vnode cache(?).

Nearly every morning when I am awake, I return to this machine and
start doing some work; within no more than 30 minutes, but not always
on the very first access on sd3 or sd4, the kernel will complain:

sd3(adw0:1:0): timed out
sd3(adw0:1:0): timed out
sd3(adw0:1:0): timed out AGAIN. Resetting SCSI Bus

after which there are one of four fates:

	a/ things behave normally
	b/ the machine hangs solid, and never recovers
	c/ the machine hangs solid from an outside perspective, but eventually
	   panics with a softdep lock held problem (rare)
	d/ the system is back to NEARLY normal operation, but with an
	   apparent vnode corruption [should I open a PR on that specifically?]

(d) is best illustrated by this:

[invoked during or triggering the adw timeout]
: crasse (es) ; ICQ&
: crasse (es) ; /u/smd/bin/ICQ: /usr/pkg/java/bin/java: not found
crasse# ls /usr/pkg/java
ls: /usr/pkg/java: Bad file descriptor

fsck -f on the raw partition is happy

fsdb -f on the raw partition says:
fsdb (inum: 2)> ls
slot 16 ino 97946 reclen 16: directory, `java'
fsdb (inum: 2)> cd java
component `java': current inode: directory
I=97946 MODE=40755 SIZE=512
        MTIME=Feb  3 08:25:22 2000 [0 nsec]
        CTIME=Mar 11 19:05:19 2000 [507351000 nsec]
        ATIME=Jul 14 05:22:05 2000 [988834000 nsec]
OWNER=root GRP=wheel LINKCNT=6 FLAGS=0x0 BLKCNT=0x4 GEN=0x1
fsdb (inum: 97946)> ls
slot 0 ino 97946 reclen 12: directory, `.'
slot 1 ino 2 reclen 12: directory, `..'
slot 2 ino 103706 reclen 16: directory, `include'
slot 3 ino 126746 reclen 12: directory, `bin'
slot 4 ino 144026 reclen 12: directory, `lib'
slot 5 ino 207386 reclen 16: directory, `demo'
slot 6 ino 98006 reclen 20: regular, `COPYRIGHT'
slot 7 ino 98007 reclen 16: regular, `'
slot 8 ino 98008 reclen 16: regular, `LICENSE'
slot 9 ino 98009 reclen 16: regular, `README'
slot 10 ino 98010 reclen 24: regular, `README.NetBSD'
slot 11 ino 98011 reclen 16: regular, `CHANGES'
slot 12 ino 98012 reclen 324: regular, `index.html'


A unmount/mount or a mount -u -o reload at this point will fix things up,
but this is kinda gross and cancerous.

note that i have seen something similar before before, with an
isp(4) hang problem that mjacob fixed:

 > Cool.  I have one other bizarre symptom; I discovered which drive was OTL
 > by firing up access on all the drives on the scsi chain, and of course the
 > ones to sd2 would hang.   However, one of the accesses was /bin/ls /usr
 > (/usr is on sd2) and got back a command prompt and a successful exit value.

	boot -current
	run for a while
	go to sleep
	wake up
	run for a (short) while, eventually touching a disk on the adw(4)
	observe timeout

in the pre-PR discussions with dante and thorpej, Jason sent me a
patch to deal with the Intel 82443BX (pchb) IPDLT (Idle/Pipeline
DRAM Leadoff Timing flag being set wrong by some bioses; this had
no effect on the timeouts AFAICT.

No fix known yet.