Subject: kern/10583: adw(4) times out and hangs occasionally in -current, vnode cache frailty
To: None <gnats-bugs@gnats.netbsd.org>
From: None <smd@ebone.net>
List: netbsd-bugs
Date: 07/17/2000 21:50:49
>Number: 10583
>Category: kern
>Synopsis: adw(4) times out and hangs occasionally in -current, vnode cache frailty
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jul 14 06:45:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator: Sean Doran
>Release: cvs as of 2000 06 11
>Organization:
>Environment:
System: NetBSD crasse.smd.ebone.net 1.5B NetBSD 1.5B (SCREAM) #0: Tue Jul 11 20:27:03 CEST 2000 smd@crasse.smd.ebone.net:/usr/src/sys/arch/i386/compile/SCREAM i386
>Description:
There are two problems here: adw(4) timeout, and the results/side-effects
of the adw(4) recovery from the timeout that point to a frailty in the
vnode cache(?).
Nearly every morning when I am awake, I return to this machine and
start doing some work; within no more than 30 minutes, but not always
on the very first access on sd3 or sd4, the kernel will complain:
sd3(adw0:1:0): timed out
sd3(adw0:1:0): timed out
sd3(adw0:1:0): timed out AGAIN. Resetting SCSI Bus
after which there are one of four fates:
a/ things behave normally
b/ the machine hangs solid, and never recovers
c/ the machine hangs solid from an outside perspective, but eventually
panics with a softdep lock held problem (rare)
d/ the system is back to NEARLY normal operation, but with an
apparent vnode corruption [should I open a PR on that specifically?]
(d) is best illustrated by this:
[invoked during or triggering the adw timeout]
: crasse (es) ; ICQ&
10549
: crasse (es) ; /u/smd/bin/ICQ: /usr/pkg/java/bin/java: not found
crasse# ls /usr/pkg/java
ls: /usr/pkg/java: Bad file descriptor
fsck -f on the raw partition is happy
fsdb -f on the raw partition says:
fsdb (inum: 2)> ls
...
slot 16 ino 97946 reclen 16: directory, `java'
...
fsdb (inum: 2)> cd java
component `java': current inode: directory
I=97946 MODE=40755 SIZE=512
MTIME=Feb 3 08:25:22 2000 [0 nsec]
CTIME=Mar 11 19:05:19 2000 [507351000 nsec]
ATIME=Jul 14 05:22:05 2000 [988834000 nsec]
OWNER=root GRP=wheel LINKCNT=6 FLAGS=0x0 BLKCNT=0x4 GEN=0x1
fsdb (inum: 97946)> ls
slot 0 ino 97946 reclen 12: directory, `.'
slot 1 ino 2 reclen 12: directory, `..'
slot 2 ino 103706 reclen 16: directory, `include'
slot 3 ino 126746 reclen 12: directory, `bin'
slot 4 ino 144026 reclen 12: directory, `lib'
slot 5 ino 207386 reclen 16: directory, `demo'
slot 6 ino 98006 reclen 20: regular, `COPYRIGHT'
slot 7 ino 98007 reclen 16: regular, `src.zip'
slot 8 ino 98008 reclen 16: regular, `LICENSE'
slot 9 ino 98009 reclen 16: regular, `README'
slot 10 ino 98010 reclen 24: regular, `README.NetBSD'
slot 11 ino 98011 reclen 16: regular, `CHANGES'
slot 12 ino 98012 reclen 324: regular, `index.html'
etc.
A unmount/mount or a mount -u -o reload at this point will fix things up,
but this is kinda gross and cancerous.
note that i have seen something similar before before, with an
isp(4) hang problem that mjacob fixed:
> Cool. I have one other bizarre symptom; I discovered which drive was OTL
> by firing up access on all the drives on the scsi chain, and of course the
> ones to sd2 would hang. However, one of the accesses was /bin/ls /usr
> (/usr is on sd2) and got back a command prompt and a successful exit value.
>How-To-Repeat:
boot -current
run for a while
go to sleep
wake up
run for a (short) while, eventually touching a disk on the adw(4)
observe timeout
>Fix:
in the pre-PR discussions with dante and thorpej, Jason sent me a
patch to deal with the Intel 82443BX (pchb) IPDLT (Idle/Pipeline
DRAM Leadoff Timing flag being set wrong by some bioses; this had
no effect on the timeouts AFAICT.
No fix known yet.
>Release-Note:
>Audit-Trail:
>Unformatted: