Subject: kern/10583: adw(4) times out and hangs occasionally in -current, vnode cache frailty
To: None <email@example.com>
From: None <firstname.lastname@example.org>
Date: 07/17/2000 21:50:49
>Synopsis: adw(4) times out and hangs occasionally in -current, vnode cache frailty
>Arrival-Date: Fri Jul 14 06:45:00 PDT 2000
>Originator: Sean Doran
>Release: cvs as of 2000 06 11
System: NetBSD crasse.smd.ebone.net 1.5B NetBSD 1.5B (SCREAM) #0: Tue Jul 11 20:27:03 CEST 2000 email@example.com:/usr/src/sys/arch/i386/compile/SCREAM i386
There are two problems here: adw(4) timeout, and the results/side-effects
of the adw(4) recovery from the timeout that point to a frailty in the
Nearly every morning when I am awake, I return to this machine and
start doing some work; within no more than 30 minutes, but not always
on the very first access on sd3 or sd4, the kernel will complain:
sd3(adw0:1:0): timed out
sd3(adw0:1:0): timed out
sd3(adw0:1:0): timed out AGAIN. Resetting SCSI Bus
after which there are one of four fates:
a/ things behave normally
b/ the machine hangs solid, and never recovers
c/ the machine hangs solid from an outside perspective, but eventually
panics with a softdep lock held problem (rare)
d/ the system is back to NEARLY normal operation, but with an
apparent vnode corruption [should I open a PR on that specifically?]
(d) is best illustrated by this:
[invoked during or triggering the adw timeout]
: crasse (es) ; ICQ&
: crasse (es) ; /u/smd/bin/ICQ: /usr/pkg/java/bin/java: not found
crasse# ls /usr/pkg/java
ls: /usr/pkg/java: Bad file descriptor
fsck -f on the raw partition is happy
fsdb -f on the raw partition says:
fsdb (inum: 2)> ls
slot 16 ino 97946 reclen 16: directory, `java'
fsdb (inum: 2)> cd java
component `java': current inode: directory
I=97946 MODE=40755 SIZE=512
MTIME=Feb 3 08:25:22 2000 [0 nsec]
CTIME=Mar 11 19:05:19 2000 [507351000 nsec]
ATIME=Jul 14 05:22:05 2000 [988834000 nsec]
OWNER=root GRP=wheel LINKCNT=6 FLAGS=0x0 BLKCNT=0x4 GEN=0x1
fsdb (inum: 97946)> ls
slot 0 ino 97946 reclen 12: directory, `.'
slot 1 ino 2 reclen 12: directory, `..'
slot 2 ino 103706 reclen 16: directory, `include'
slot 3 ino 126746 reclen 12: directory, `bin'
slot 4 ino 144026 reclen 12: directory, `lib'
slot 5 ino 207386 reclen 16: directory, `demo'
slot 6 ino 98006 reclen 20: regular, `COPYRIGHT'
slot 7 ino 98007 reclen 16: regular, `src.zip'
slot 8 ino 98008 reclen 16: regular, `LICENSE'
slot 9 ino 98009 reclen 16: regular, `README'
slot 10 ino 98010 reclen 24: regular, `README.NetBSD'
slot 11 ino 98011 reclen 16: regular, `CHANGES'
slot 12 ino 98012 reclen 324: regular, `index.html'
A unmount/mount or a mount -u -o reload at this point will fix things up,
but this is kinda gross and cancerous.
note that i have seen something similar before before, with an
isp(4) hang problem that mjacob fixed:
> Cool. I have one other bizarre symptom; I discovered which drive was OTL
> by firing up access on all the drives on the scsi chain, and of course the
> ones to sd2 would hang. However, one of the accesses was /bin/ls /usr
> (/usr is on sd2) and got back a command prompt and a successful exit value.
run for a while
go to sleep
run for a (short) while, eventually touching a disk on the adw(4)
in the pre-PR discussions with dante and thorpej, Jason sent me a
patch to deal with the Intel 82443BX (pchb) IPDLT (Idle/Pipeline
DRAM Leadoff Timing flag being set wrong by some bioses; this had
no effect on the timeouts AFAICT.
No fix known yet.