Subject: Re: NetBSD 1.5.1 GENERIC Kernel Crash with /tmp on mfs
To: Jenkins, Graham K \[IBM GSA\] <Graham.K.Jenkins@team.telstra.com>
From: Gregory McGarry <g.mcgarry@ieee.org>
List: port-i386
Date: 08/01/2001 08:18:08
Jenkins, Graham K [IBM GSA] wrote:

> > Guys, I've had a number of kernel crashes whilst trying to build
> > a new kernel on a 486 with 16Mb memory, with /tmp on a memory filesystem
> > in swap space.  Happens about 40 minutes after restarting the 'make'. 
> > Unmounting /tmp seemed to solve the problem.
>
> I've replicated the problem on another box.  A few more observations:
> 
> 1) Can also occur if /tmp resides on root partition (or anyplace else?)
> 2) Can also occur doing 'bzip2 -d' on large files
> 3) Still occurs even if more swap space assigned (e.g. 160Mb for 16Mb
> machine);
>     swap usage (as shown by 'swapctl -lk') never goes beyond 24Mb
> 4) Doesn't occur when using a custom kernel (GENERIC_TINY, with addition of
>     XSERVER, MFS, UCONSOLE and some PCI network drivers - and omission
>     of I686_CPU
> 
> In cases 1 thru 3, the console message is:
> ---
> kernel: page fault trap, code=0
> Stopped at      sw1+0x15:       movl     %eax,     0x4(%edx)
> db> reboot
> ---
> So my impression is that it's something in the GENERIC kernel - or perhaps
> in I686-specific code.

It looks like the run queues are getting corrupted somewhere.

I wasn't able to reproduce this problem on a DX4 with 32MB RAM
+ 64MB swap running -current.  Tried a custom kernel and a GENERIC
kernel.  I filled the mfs file system with data until swap was
exhausted and still didn't provoke this problem.

To go from here, a stack trace and list of the system
processes is required.  Probably a coredump too.  I'd also
suggest you incrementally add config options to your custom
kernel to hunt down the problem.  I'll try a 1.5.1 kernel.

What I did notice, is that when my machine ran out of swap
the machine locked-up due to pagedaemon and aiodoned fighting
for memory.  A breakpoint in mi_switch() sees the processor
bouncing between these two processes.  Not good.

	-- Gregory McGarry <g.mcgarry@ieee.org>