Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Greg Oster <oster@cs.usask.ca>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: port-i386
Date: 04/04/2003 17:14:02
Hello Greg et al. I've done a bit more research on my "hanging"
problem with NetBSD/i386-1.6.1_RC2. The symptom I'm seeing is that the
machine appears to hang. The kernel is still running, as evidenced by the
fact that the machine is stil pingable, but is otherwise unresponsive.
This machine has a raid5 set of 3 disks, which comprise all of its
storage. It's swapping to a stand-alone partition on one of the disks, due
to the fact that it's a known problem that swapping to a raid5 partition
can cause similar hangs.
This morning I was able to force a panic dump after a hang, and found
that the problem is that one of the processes is stuck in flt_pmfail1 or
flt_pmfail2, I'm not sure which, which usr/src/sys/uvm/uvm_fault.c says
means I'm out of memory. However, vmstat -s on the kernel crash file
claims I have 221 fre pages and that there is no paging operation in
progress. In addition, vmstat -m shows I'm using only 3MB of kernel
memory, out of a possible 64MB of kernel memory.
Does anyone have any ideas on what resource I might be running out of?
I have 128MB of memory in the machine. Alternatively, has anyone else seen
this problem and have they solved it? I have the kernel core image and am
willing to try any commands anyone might want to suggest. Or, if someone
wants the core file for examination, I'm happy to provide it.
Just for fun, here's what ps -lax has to say about the processes on the
system at the time of the crash.
-Brian
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
0 0 -1077951064 0 -18 0 0 0 schedule DKs ?? 0:00.00 [swapper]
0 1 -1077951064 0 10 0 348 0 wait TWs ?? 0:00.00 init
0 2 -1077951064 0 -6 0 0 0 sccomp DK ?? 0:00.00 [atapibus0]
0 3 -1077951064 0 10 0 0 0 - RK ?? 0:00.00 [usb0]
0 4 -1077951064 0 10 0 0 0 usbtsk DK ?? 0:00.00 [usbtask]
0 5 -1077951064 0 10 0 0 0 - RK ?? 0:00.00 [apm0]
0 6 -1077951064 0 -6 0 0 0 - RK ?? 0:00.00 [raid]
0 7 -1077951064 36 -18 0 0 0 - RK ?? 0:36.00 [pagedaemon
0 8 -1077951064 0 -18 0 0 0 reaper DK ?? 0:00.00 [reaper]
0 9 -1077951064 0 18 0 0 0 - RK ?? 0:00.00 [ioflush]
0 10 -1077951064 0 -18 0 0 0 aiodoned DK ?? 0:00.00 [aiodoned]
0 24 -1077951064 29 -6 0 0 0 rfwcond DK ?? 0:29.00 [raid_parit
0 85 -1077951064 0 2 0 256 0 - Ts ?? 0:00.00 /usr/sbin/s
0 96 -1077951064 0 2 0 3016 0 - Ts ?? 0:00.00 /usr/sbin/n
0 101 -1077951064 0 2 0 120 0 - Ts ?? 0:00.00 (rpcbind)
0 118 -1077951064 0 -22 0 0 0 actwat DK ?? 0:00.00 [acctwatch]
0 149 -1077951064 0 2 0 1040 0 select TWs ?? 0:00.00 (dhcpd)
0 158 -1077951064 0 2 0 968 0 - Ts ?? 0:00.00 (httpd)
0 166 -1077951064 0 2 0 984 0 - Ts ?? 0:00.00 (nmbd)
0 172 -1077951064 31 2 0 1848 0 select TWs ?? 0:31.00 (smbd)
65533 174 -1077951064 0 2 0 1048 0 netcon TW ?? 0:00.00 (httpd)
65533 175 -1077951064 0 2 0 1048 0 netcon TW ?? 0:00.00 /usr/pkg/sb
65533 176 -1077951064 0 2 0 1048 0 netcon TW ?? 0:00.00 /usr/pkg/sb
65533 177 -1077951064 0 2 0 1064 0 netcon TW ?? 0:00.00 /usr/pkg/sb
65533 178 -1077951064 0 2 0 1064 0 netcon TW ?? 0:00.00 /usr/pkg/sb
25 180 -1077951064 27 2 0 696 0 netcon TW ?? 0:27.00 (websterd)
25 197 -1077951064 0 10 0 2304 0 - T ?? 0:00.00 /usr/pkg/sb
65533 201 -1077951064 0 2 0 1064 0 netcon TW ?? 0:00.00 (httpd)
25 202 -1077951064 24 2 0 2296 0 netcon TW ?? 0:24.00 /usr/local/
7 205 -1077951064 31 2 0 1356 0 select TWs ?? 0:31.00 (postgres)
0 210 -1077951064 31 2 0 112 0 select TWs ?? 0:31.00 (lpd)
0 233 -1077951064 7 2 0 388 0 select TWs ?? 0:07.00 (sshd)
0 236 -1077951064 0 2 0 720 0 - Ts ?? 0:00.00 (sendmail)
0 245 -1077951064 0 2 0 212 0 - Ts ?? 0:00.00 (inetd)
0 250 -1077951064 0 10 0 248 0 - Ts ?? 0:00.00 /usr/sbin/i
0 260 -1077951064 0 2 0 168 0 - T ?? 0:00.00 (telnetd)
65533 331 -1077951064 0 2 0 1064 0 netcon TW ?? 0:00.00 (httpd)
65533 332 -1077951064 0 2 0 1052 0 netcon TW ?? 0:00.00 (httpd)
65533 333 -1077951064 0 2 0 1052 0 netcon TW ?? 0:00.00 (httpd)
0 508 -1077951064 0 2 0 32 0 - T ?? 0:00.00 (comsat)
0 588 -1077951064 0 2 0 764 0 - T ?? 0:00.00 (sendmail)
0 874 -1077951064 0 -6 0 248 0 piperd T ?? 0:00.00 (cron)
0 875 -1077951064 0 -6 0 248 0 piperd T ?? 0:00.00 (cron)
0 876 -1077951064 0 -6 0 248 0 piperd T ?? 0:00.00 (cron)
0 877 -1077951064 0 -6 0 248 0 piperd T ?? 0:00.00 (cron)
100 880 -1077951064 0 10 0 480 0 wait Ts ?? 0:00.00 (sh)
100 881 -1077951064 0 10 0 480 0 wait Ts ?? 0:00.00 (sh)
100 882 -1077951064 0 10 0 480 0 wait Ts ?? 0:00.00 (sh)
0 883 -1077951064 0 -6 0 248 0 piperd T ?? 0:00.00 (cron)
100 886 -1077951064 0 10 0 480 0 wait Ts ?? 0:00.00 (sh)
100 887 -1077951064 0 10 0 480 0 wait Ts ?? 0:00.00 (sh)
100 895 -1077951064 0 10 0 480 0 wait T ?? 0:00.00 (sh)
100 897 -1077951064 0 10 0 488 0 wait T ?? 0:00.00 (sh)
100 898 -1077951064 0 10 0 488 0 wait T ?? 0:00.00 (sh)
100 899 -1077951064 0 10 0 488 0 wait T ?? 0:00.00 (sh)
100 905 -1077951064 7 10 0 488 0 wait T ?? 0:07.00 (sh)
100 933 -1077951064 8 -6 0 808 0 piperd T ?? 0:08.00 (expect)
100 935 -1077951064 0 2 0 1076 0 - T ?? 0:00.00 (expect)
100 942 -1077951064 0 2 0 1076 0 - T ?? 0:00.00 (expect)
100 943 -1077951064 0 2 0 1076 0 - T ?? 0:00.00 (expect)
100 954 -1077951064 0 -1 0 316 0 - T ?? 0:00.00 (awk)
100 958 -1077951064 36 -18 0 800 0 flt_pmfa TL ?? 0:36.00 (expect)
100 262 -1077951064 0 18 0 484 0 pause TWs p0 0:00.00 (csh)
100 276 -1077951064 0 18 0 168 0 - T p0 0:00.00 (monitor)
100 278 -1077951064 0 2 0 1708 0 - T+ p0 0:00.00 (window)
100 341 -1077951064 0 28 0 268 0 - TW p0 0:00.00 (telnet)
100 279 -1077951064 0 18 0 484 0 pause TWs p1 0:00.00 (csh)
100 868 -1077951064 0 3 0 148 0 ttyin TW+ p1 0:00.00 (more)
100 280 -1077951064 0 3 0 476 0 ttyin TWs+ p2 0:00.00 (csh)
100 281 -1077951064 0 3 0 476 0 ttyin TWs+ p3 0:00.00 (csh)
100 282 -1077951064 0 3 0 476 0 ttyin TWs+ p4 0:00.00 (csh)
100 283 -1077951064 0 3 0 476 0 ttyin TWs+ p5 0:00.00 (csh)
100 284 -1077951064 1 3 0 476 0 ttyin TWs+ p6 0:01.00 (csh)
100 285 -1077951064 0 3 0 476 0 ttyin TWs+ p7 0:00.00 (csh)
100 286 -1077951064 0 3 0 476 0 ttyin TWs+ p8 0:00.00 (csh)
100 287 -1077951064 0 3 0 476 0 ttyin TWs+ p9 0:00.00 (csh)
100 944 -1077951064 0 2 0 252 0 - Ts+ pa 0:00.00 (telnet)
100 946 -1077951064 0 2 0 252 0 - Ts+ pb 0:00.00 (telnet)
100 948 -1077951064 0 2 0 252 0 - Ts+ pc 0:00.00 (telnet)
0 17 -1077951064 0 10 0 572 0 wait TW 00- 0:00.00 (sh)
0 19 -1077951064 0 10 0 572 0 wait TW 00- 0:00.00 (sh)
0 23 -1077951064 0 10 0 172 0 - T 00- 0:00.00 (raidctl)
0 253 -1077951064 0 3 0 48 0 - Ts+ 00 0:00.00 (getty)