Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Greg Oster <oster@cs.usask.ca>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: port-i386
Date: 04/04/2003 17:14:02
	Hello Greg et al.  I've done a bit more research on my "hanging"
problem with NetBSD/i386-1.6.1_RC2.  The symptom I'm seeing is that the
machine appears to hang.  The kernel is still running, as evidenced by the
fact that the machine is stil pingable, but is otherwise unresponsive.
This  machine has a raid5 set of 3 disks, which comprise all of its
storage.  It's swapping to a stand-alone partition on one of the disks, due
to the fact that it's a known problem that swapping to a raid5 partition
can cause similar hangs.
	This morning I was able to force a panic dump after a hang, and found
that the problem is that one of the processes is stuck in flt_pmfail1 or
flt_pmfail2, I'm not sure which, which usr/src/sys/uvm/uvm_fault.c says
means I'm out of memory.  However, vmstat -s on the kernel crash file
claims I have 221 fre pages and that there is no paging operation in
progress.  In addition, vmstat -m shows I'm using only 3MB of kernel
memory, out of a possible 64MB of kernel memory.
	Does anyone have any ideas on what resource I might be running out of?
I have 128MB of memory in the machine.  Alternatively, has anyone else seen
this problem and have they solved it?  I have the kernel core image and am
willing to try any commands anyone might want to suggest.  Or, if someone
wants the core file for examination, I'm happy to provide it.

Just for fun, here's what ps -lax has to say about the processes on the
system at the time of the crash.

-Brian
  UID PID        PPID CPU PRI NI  VSZ RSS WCHAN    STAT TT     TIME COMMAND
    0   0 -1077951064   0 -18  0    0   0 schedule DKs  ??  0:00.00 [swapper]
    0   1 -1077951064   0  10  0  348   0 wait     TWs  ??  0:00.00 init 
    0   2 -1077951064   0  -6  0    0   0 sccomp   DK   ??  0:00.00 [atapibus0]
    0   3 -1077951064   0  10  0    0   0 -        RK   ??  0:00.00 [usb0]
    0   4 -1077951064   0  10  0    0   0 usbtsk   DK   ??  0:00.00 [usbtask]
    0   5 -1077951064   0  10  0    0   0 -        RK   ??  0:00.00 [apm0]
    0   6 -1077951064   0  -6  0    0   0 -        RK   ??  0:00.00 [raid]
    0   7 -1077951064  36 -18  0    0   0 -        RK   ??  0:36.00 [pagedaemon
    0   8 -1077951064   0 -18  0    0   0 reaper   DK   ??  0:00.00 [reaper]
    0   9 -1077951064   0  18  0    0   0 -        RK   ??  0:00.00 [ioflush]
    0  10 -1077951064   0 -18  0    0   0 aiodoned DK   ??  0:00.00 [aiodoned]
    0  24 -1077951064  29  -6  0    0   0 rfwcond  DK   ??  0:29.00 [raid_parit
    0  85 -1077951064   0   2  0  256   0 -        Ts   ??  0:00.00 /usr/sbin/s
    0  96 -1077951064   0   2  0 3016   0 -        Ts   ??  0:00.00 /usr/sbin/n
    0 101 -1077951064   0   2  0  120   0 -        Ts   ??  0:00.00 (rpcbind)
    0 118 -1077951064   0 -22  0    0   0 actwat   DK   ??  0:00.00 [acctwatch]
    0 149 -1077951064   0   2  0 1040   0 select   TWs  ??  0:00.00 (dhcpd)
    0 158 -1077951064   0   2  0  968   0 -        Ts   ??  0:00.00 (httpd)
    0 166 -1077951064   0   2  0  984   0 -        Ts   ??  0:00.00 (nmbd)
    0 172 -1077951064  31   2  0 1848   0 select   TWs  ??  0:31.00 (smbd)
65533 174 -1077951064   0   2  0 1048   0 netcon   TW   ??  0:00.00 (httpd)
65533 175 -1077951064   0   2  0 1048   0 netcon   TW   ??  0:00.00 /usr/pkg/sb
65533 176 -1077951064   0   2  0 1048   0 netcon   TW   ??  0:00.00 /usr/pkg/sb
65533 177 -1077951064   0   2  0 1064   0 netcon   TW   ??  0:00.00 /usr/pkg/sb
65533 178 -1077951064   0   2  0 1064   0 netcon   TW   ??  0:00.00 /usr/pkg/sb
   25 180 -1077951064  27   2  0  696   0 netcon   TW   ??  0:27.00 (websterd)
   25 197 -1077951064   0  10  0 2304   0 -        T    ??  0:00.00 /usr/pkg/sb
65533 201 -1077951064   0   2  0 1064   0 netcon   TW   ??  0:00.00 (httpd)
   25 202 -1077951064  24   2  0 2296   0 netcon   TW   ??  0:24.00 /usr/local/
    7 205 -1077951064  31   2  0 1356   0 select   TWs  ??  0:31.00 (postgres)
    0 210 -1077951064  31   2  0  112   0 select   TWs  ??  0:31.00 (lpd)
    0 233 -1077951064   7   2  0  388   0 select   TWs  ??  0:07.00 (sshd)
    0 236 -1077951064   0   2  0  720   0 -        Ts   ??  0:00.00 (sendmail)
    0 245 -1077951064   0   2  0  212   0 -        Ts   ??  0:00.00 (inetd)
    0 250 -1077951064   0  10  0  248   0 -        Ts   ??  0:00.00 /usr/sbin/i
    0 260 -1077951064   0   2  0  168   0 -        T    ??  0:00.00 (telnetd)
65533 331 -1077951064   0   2  0 1064   0 netcon   TW   ??  0:00.00 (httpd)
65533 332 -1077951064   0   2  0 1052   0 netcon   TW   ??  0:00.00 (httpd)
65533 333 -1077951064   0   2  0 1052   0 netcon   TW   ??  0:00.00 (httpd)
    0 508 -1077951064   0   2  0   32   0 -        T    ??  0:00.00 (comsat)
    0 588 -1077951064   0   2  0  764   0 -        T    ??  0:00.00 (sendmail)
    0 874 -1077951064   0  -6  0  248   0 piperd   T    ??  0:00.00 (cron)
    0 875 -1077951064   0  -6  0  248   0 piperd   T    ??  0:00.00 (cron)
    0 876 -1077951064   0  -6  0  248   0 piperd   T    ??  0:00.00 (cron)
    0 877 -1077951064   0  -6  0  248   0 piperd   T    ??  0:00.00 (cron)
  100 880 -1077951064   0  10  0  480   0 wait     Ts   ??  0:00.00 (sh)
  100 881 -1077951064   0  10  0  480   0 wait     Ts   ??  0:00.00 (sh)
  100 882 -1077951064   0  10  0  480   0 wait     Ts   ??  0:00.00 (sh)
    0 883 -1077951064   0  -6  0  248   0 piperd   T    ??  0:00.00 (cron)
  100 886 -1077951064   0  10  0  480   0 wait     Ts   ??  0:00.00 (sh)
  100 887 -1077951064   0  10  0  480   0 wait     Ts   ??  0:00.00 (sh)
  100 895 -1077951064   0  10  0  480   0 wait     T    ??  0:00.00 (sh)
  100 897 -1077951064   0  10  0  488   0 wait     T    ??  0:00.00 (sh)
  100 898 -1077951064   0  10  0  488   0 wait     T    ??  0:00.00 (sh)
  100 899 -1077951064   0  10  0  488   0 wait     T    ??  0:00.00 (sh)
  100 905 -1077951064   7  10  0  488   0 wait     T    ??  0:07.00 (sh)
  100 933 -1077951064   8  -6  0  808   0 piperd   T    ??  0:08.00 (expect)
  100 935 -1077951064   0   2  0 1076   0 -        T    ??  0:00.00 (expect)
  100 942 -1077951064   0   2  0 1076   0 -        T    ??  0:00.00 (expect)
  100 943 -1077951064   0   2  0 1076   0 -        T    ??  0:00.00 (expect)
  100 954 -1077951064   0  -1  0  316   0 -        T    ??  0:00.00 (awk)
  100 958 -1077951064  36 -18  0  800   0 flt_pmfa TL   ??  0:36.00 (expect)
  100 262 -1077951064   0  18  0  484   0 pause    TWs  p0  0:00.00 (csh)
  100 276 -1077951064   0  18  0  168   0 -        T    p0  0:00.00 (monitor)
  100 278 -1077951064   0   2  0 1708   0 -        T+   p0  0:00.00 (window)
  100 341 -1077951064   0  28  0  268   0 -        TW   p0  0:00.00 (telnet)
  100 279 -1077951064   0  18  0  484   0 pause    TWs  p1  0:00.00 (csh)
  100 868 -1077951064   0   3  0  148   0 ttyin    TW+  p1  0:00.00 (more)
  100 280 -1077951064   0   3  0  476   0 ttyin    TWs+ p2  0:00.00 (csh)
  100 281 -1077951064   0   3  0  476   0 ttyin    TWs+ p3  0:00.00 (csh)
  100 282 -1077951064   0   3  0  476   0 ttyin    TWs+ p4  0:00.00 (csh)
  100 283 -1077951064   0   3  0  476   0 ttyin    TWs+ p5  0:00.00 (csh)
  100 284 -1077951064   1   3  0  476   0 ttyin    TWs+ p6  0:01.00 (csh)
  100 285 -1077951064   0   3  0  476   0 ttyin    TWs+ p7  0:00.00 (csh)
  100 286 -1077951064   0   3  0  476   0 ttyin    TWs+ p8  0:00.00 (csh)
  100 287 -1077951064   0   3  0  476   0 ttyin    TWs+ p9  0:00.00 (csh)
  100 944 -1077951064   0   2  0  252   0 -        Ts+  pa  0:00.00 (telnet)
  100 946 -1077951064   0   2  0  252   0 -        Ts+  pb  0:00.00 (telnet)
  100 948 -1077951064   0   2  0  252   0 -        Ts+  pc  0:00.00 (telnet)
    0  17 -1077951064   0  10  0  572   0 wait     TW   00- 0:00.00 (sh)
    0  19 -1077951064   0  10  0  572   0 wait     TW   00- 0:00.00 (sh)
    0  23 -1077951064   0  10  0  172   0 -        T    00- 0:00.00 (raidctl)
    0 253 -1077951064   0   3  0   48   0 -        Ts+  00  0:00.00 (getty)