NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/38019: some kind of undetected deadlock slowly kills NetBSD-4.0_STABLE GENERIC.MP
>Number: 38019
>Category: kern
>Synopsis: some kind of undetected deadlock slowly kills
>NetBSD-4.0_STABLE GENERIC.MP
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Feb 13 17:40:01 +0000 2008
>Originator: Greg A. Woods
>Release: NetBSD 4.0_STABLE 2008/02/10
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD 4.0_STABLE GENERIC.MP
Architecture: i386
Machine: i386
>Description:
I've been experiencing regular hangs and unkillable processes on
my Dell PE2650 running NetBSD-4.0_STABLE GENERIC.MP.
The most regular problem is triggered during the big "find" runs
invoked by /etc/daily et al.
The system managed to make it through its nightly cron jobs
without hanging last night and I managed to use it (lightly) for
several hours this morning before problems began to appear. I
had been investigating problems with the apcupsd package and had
unpacked it and built it once or twice, then suddenly the
"extract" phase hung, but the make process was interruptable, so
I tried it again, with the same result. Soon I discovered the
gzcat and tar processes from both attempts were still present,
and they were unkillable.
Interestingly it doesn't seem to be access to the file gzcat is
reading which causes problems. Clearly the second run of
"digest" didn't hang, and manual access with 'cat' and 'dd'
works without hanging too.
When the nightly "find" is one that locks it seems all access to
the same device (and filesystem?) soon causes all (useful)
processes to lock up.
The common denomiator with the oft-nightly hangs and this
currently more minor hang is that some of the stuck processes
are in "vmmapva" and they are unkillable.
This may, or may not, be related to the same causes of the
problem my PR#37993.
One difference between this machine and others I know are
running very similar kernels is that this machine's local
filesystems are all accessed via the built-in Dell PERC/3Di
controller and the aac(4) (and ld(4)) drivers. The aac(4)
driver is known to be rather buggy, the worst problem of which
is that it seems to miss some interrupts, but a regular job
reading a block from the raw /dev/ld0d device seems to wake it
up enough to keep things running normally. See the 'sh' process
running 'dd' in a loop in the info below.
I'll try to capture a crash dump from the system as I reboot it.
I'll also try to find and back-port some of the aac(4) fixes
that were discussed some time ago when I first noted that the
driver is buggy and is als way out of date w.r.t. its original
sources in FreeBSD.
12:07 [2291] # df
Filesystem 512-blocks Used Avail %Cap Mounted on
/dev/ld0a 4032824 2294492 1536692 59% /
/dev/ld0e 10089944 2127724 7457724 22% /var
/dev/ld0f 16130008 10325496 4998012 67% /usr/pkg
/dev/ld0g 231522520 131320032 88626364 59% /rest
mfs:413 2024220 142308 1780704 7% /tmp
kernfs 2 2 0 100% /kern
most:/var/spool/ftp/pub/mirror 8749688 1896624 6415576 22%
/var/package-distfiles
12:08 [2292] # mount
/dev/ld0a on / type ffs (local)
/dev/ld0e on /var type ffs (nosuid, nodev, NFS exported, local)
/dev/ld0f on /usr/pkg type ffs (nodev, soft dependencies, local)
/dev/ld0g on /rest type ffs (nosuid, nodev, NFS exported, local)
mfs:413 on /tmp type mfs (synchronous, nosuid, nodev, local)
kernfs on /kern type kernfs (local)
most:/var/spool/ftp/pub/mirror on /var/package-distfiles type nfs (nosuid,
nodev)
12:08 [2293] # ps -la
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1000 12225 10492 0 3 0 604 4 ttyin IWs+ ttyp0 0:00.48 -ksh
1000 3359 3354 0 18 0 568 4 pause IWs ttyp1 0:00.09 -ksh
0 11549 3359 0 3 0 604 588 ttyin I+ ttyp1 0:00.12 ksh
1000 11156 10187 91 3 0 572 584 ttyin Is+ ttyp2 0:00.28 -ksh
1000 10696 10694 0 3 0 580 592 ttyin Is+ ttyp3 0:00.20 -ksh
0 7610 17451 0 -18 0 424 368 vmmapva D ttyp4 0:00.74
/usr/bin/gzcat /var/package-distfiles//apcupsd-3.
0 10067 27688 0 28 0 364 324 - R+ ttyp4 0:00.00 ps -la
1000 12435 29241 1746 18 0 568 4 pause IWs ttyp4 0:00.12 -ksh
0 15005 27796 0 -18 0 424 368 vmmapva D ttyp4 0:00.83
/usr/bin/gzcat /var/package-distfiles//apcupsd-3.
0 15626 27796 0 2 0 424 360 pipecl D ttyp4 0:00.01 /bin/tar -xf
-
0 17451 1 0 10 0 472 436 wait I ttyp4 0:00.00 (sh)
0 17645 17451 0 2 0 424 360 pipecl D ttyp4 0:00.00 /bin/tar -xf
-
0 27688 12435 0 18 0 684 704 pause S ttyp4 0:00.62 ksh
0 27796 1 0 10 0 472 436 wait I ttyp4 0:00.00 (sh)
1000 532 2485 1258 18 0 568 4 pause IWs ttyp5 0:00.13 -ksh
1000 7340 532 0 2 0 1512 1072 select I+ ttyp5 0:01.18 slogin
whome.planix.com (ssh2)
1000 22241 12282 0 3 0 564 560 ttyin Is+ ttyp6 0:00.31 -ksh
1000 8897 6756 0 18 0 568 572 pause Is ttyp7 0:00.12 -ksh
0 27674 8897 0 3 0 604 624 ttyin I+ ttyp7 0:00.08 ksh
0 350 1 0 10 0 456 364 wait S tty00- 1:37.04 sh -c while
: ; do dd if=/dev/rld0d of=/dev/null
0 2098 1 0 3 0 256 4 ttyin IWs+ tty00 0:00.07
/usr/libexec/getty default constty
0 1582 1 1348 3 0 256 4 ttyin IWs+ ttyE0 0:00.07
/usr/libexec/getty Pc ttyE0
0 1511 1 0 3 0 436 4 ttyin IWs+ ttyE1 0:00.16 -ksh
0 2103 1 1348 3 0 256 4 ttyin IWs+ ttyE2 0:00.28
/usr/libexec/getty Pc ttyE2
0 2066 1 1348 3 0 256 4 ttyin IWs+ ttyE3 0:00.08
/usr/libexec/getty Pc ttyE3
0 2041 1 1348 3 0 256 4 ttyin IWs+ ttyE4 0:00.08
/usr/libexec/getty Pc ttyE4
0 1980 1 1348 3 0 256 4 ttyin IWs+ ttyE5 0:00.08
/usr/libexec/getty Pc ttyE5
0 1981 1 1348 3 0 256 4 ttyin IWs+ ttyE6 0:00.15
/usr/libexec/getty Pc ttyE6
12:09 [2294] # top
12:10 [2295] # fstat -p 7610
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
root gzcat 7610 wd /rest 9070540 drwxr-xr-x 512 r
root gzcat 7610 0 / 162324 crw------- ttyp4 rw
root gzcat 7610 1* pipe 0xd7b282bc -> 0xd7b284b0 w
root gzcat 7610 2 / 162324 crw------- ttyp4 rw
root gzcat 7610 3 /var/package-distfiles 121 -rw-r--r--
4356614 r
root gzcat 7610 5 - - none -
12:16 [2296] # fstat -p 15005
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
root gzcat 15005 wd /rest 9070539 drwxr-xr-x 0 r
root gzcat 15005 0 / 162324 crw------- ttyp4 rw
root gzcat 15005 1* pipe 0xd7b28258 -> 0xd7b28064 w
root gzcat 15005 2 / 162324 crw------- ttyp4 rw
root gzcat 15005 3 /var/package-distfiles 121 -rw-r--r--
4356614 r
root gzcat 15005 5 - - none -
12:16 [2297] # kill -9 15626 7610 15005 17645
12:18 [2298] # kill -9 15626 7610 15005 17645
12:18 [2399] # kill -9 15626 7610 15005 17645
12:18 [2300] # fstat -p 15626
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
root tar 15626 wd /rest 9070539 drwxr-xr-x 0 r
root tar 15626 1 / 162324 crw------- ttyp4 rw
root tar 15626 2 / 162324 crw------- ttyp4 rw
root tar 15626 3 / 160798 crw-rw-rw- tty rw
root tar 15626 4 /rest 9070539 drwxr-xr-x 0 r
root tar 15626 5 - - none -
12:20 [2301] # fstat -p 17645
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
root tar 17645 wd /rest 9070540 drwxr-xr-x 512 r
root tar 17645 1 / 162324 crw------- ttyp4 rw
root tar 17645 2 / 162324 crw------- ttyp4 rw
root tar 17645 3 / 160798 crw-rw-rw- tty rw
root tar 17645 4 /rest 9070540 drwxr-xr-x 512 r
root tar 17645 5 - - none -
12:18 [2302] #
>How-To-Repeat:
>Fix:
Home |
Main Index |
Thread Index |
Old Index