Subject: Re: netbsd 4.0 beta2 crashes when swapping
To: Pavel Cahyna <pavel@netbsd.org>
From: Todd Kover <kovert@omniscient.com>
List: port-i386
Date: 05/06/2007 19:01:50
 > > Most of the time I was running X, so no debugger, but the one time
 > > I was not, I had a kernel message:
 > >
 > >    uvm_fault(0xc0a534c, 0xf0000, 2) -> 0xe
 >
 > Is this all? 

at the time, that was all I got and the machine was wedged.  Since then,
I've been occasionally able to get it to kick into ddb and get a dump
(by gzcat'ing a crash dump, actually so no X), but the kernel was not
built with -g / DEBUG, so not terribly useful.

 > i think it should print also something like
 > kernel: page fault trap, code=0 
 > Stopped at netbsd:<function name>+<address>: <assembly code>

The most recent time I forced a crash and got into ddb, I got:

uvm_fault(0xc09ff800, 0xf000, 1) -> 0xe
kernel: supervisor trap page fault, code=0
Stopped in pid 17.1 (pagedaemon) at 0xffff: invalid address
db>

A bt or trace gave me a DDB Fault and no useful information.  In gdb, I got:

(gdb) target kvm netbsd.3.core
#0  0xc04d8eb6 in cpu_reboot ()
(gdb) where
#0  0xc04d8eb6 in cpu_reboot ()
#1  0xc01bdddf in db_sync_cmd ()
#2  0xc01be2d0 in db_command ()
#3  0xc01be688 in db_command_loop ()
#4  0xc01c1481 in db_trap ()
#5  0xc04d500f in kdb_trap ()
#6  0xc04e1c5b in trap ()
#7  0xc010b0c3 in calltrap ()
#8  0xc0430010 in sysctl_proc_stop ()
#9  0xc044abc5 in pool_drain ()
#10 0xc03ec98f in uvm_pageout ()
#11 0xc01002e1 in proc_trampoline ()

which seems not very useful, but maybe is to someone more enlightented
than I.
 
 > The last line could be interesting, especially if you have a netbsd.gdb
 > file (created by compiling with makeoptions DEBUG="-g").

I have yet to be able to get a kernel compiled with these to get into
the ddb (and a fsck takes ~ 30 minutes if I can't sync the disks, so
this is kind of slow going).

I was able to go back to a 4.0_beta2 kernel from apr 16, 2007 that did
not exhibit this behavior, and tried to completely a fresh beta2 build
from yesterday (removing my obj and dest dirs) and still got the crash
to happen.  Both had my pullup of PR/35970 so I've ruled that out.

The kernel that does not exhibit the behavior:

NetBSD slivovice.cz.omniscient.com 4.0_BETA2 NetBSD 4.0_BETA2 (GENERIC) #14: Mon Apr 16 01:28:59 EDT 2007  kovert@saidin.omniscient.com:/usr/obj/4.0-stable/i386/omniscient/os/NetBSD-4.0-branch/src/sys/arch/i386/compile/GENERIC i386

(the cvs update for the branch would have been earlier that day, perhaps
as much as 18 hours earlier).

getting cvs to actually show a diff within a branch by date has been
surprisingly painful with cvs 1.11, so I'm still trying to track down
what might have changed.

From looking at the working kernel, what happens when thing work is
the buffer cache is consuming available memory and the free memory (as
described by top) hovers around 0 (300ishk through 1500ishk).  With
a problematic kernel as soon as it hits around there, I get either a
wedge, or occasionally kicked into the ddb with the above error.

I'm also trying to build 4.0_beta2 on the machine that the functioning
kernel was built on rather than this box, to try to rule out something
weird here, since it seem weird that I'm the only one seeing this.

I also continue to try to get a kernel built with -g to kick me into
ddb...

Dekuju,
-Todd