[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: netbsd-6 instability - vmem
On Thu, Feb 07, 2013 at 01:15:52PM -0500, Greg Troxel wrote:
> ... my working hypothesis is
> that kernel virtual address space is being exhausted, and that this
> isn't handled well. So anything that causes more kernel virtual space
> If you can avoid running X, and then provoke a lockup, ddb may be
> interesting. I found processes in vmem and tstile, and 'show pool'
> indicated failure of the pool code to get memory.
> Also, without X, if there are any disk issues, you're more likely to see
> the logs.
The suspicion of memory exhaustion's having possibly caused my
system's issues seems to have been on target. The short version:
whenever the system needed swap space, and I was using a more com-
plex subsystem than a basic disk block--such as raidframe or cgd--
as my swap device, a lock-up occurred, requiring a power-cycle.
I don't know what the actual, fundamental issue is, nor how to
fix it, but I learned some specifics, which follow.
Exploring the problem with no X, on the macppc box, I narrowed
down the circumstances under which lock-ups occur by trying various
swap configurations. Normally, my swap partition is a cgd device
(e.g., /dev/cgd1b) backed by a raidframe mirror. To test, I loaded
the system with a few artificially heavy CPU/IO/memory- intensive
user processes, and was able to find that
o running with NO swap partition seems to restore system stabil-
ity--UVM does of course unapologetically kill userland process-
es as memory runs low... but the system doesn't lock or crash
o starting out with no swap partition and then ADDING a vanilla--
non-raidframe--swap device (once "top" showed low memory), such
as /dev/wd0g, also seems to make the system stable--and without
the penalty of UVM's killing userland processes--but also with-
out the benefit of raidframe's resiliency or cgd's security
o OTOH starting with no swap partition, and then similarly adding
a raidframe backed--but non-encrypted--device (again, once free
memory runs low), such as /dev/raid1b, locks up the system
o likewise, starting out with swap already enabled on /dev/raid1b
still results in a locked-up system, once free memory runs low
and swapping or paging kicks in
o (interesting case, by contrast:)
o starting with no swap partition
o adding vanilla /dev/wd0g swap device
o adding a subsequent raidframe /dev/raid1b swap device, and,
o finally, removing /dev/wd0g as a swap device
seems to leave the system stable (although I didn't try stop-
ping and restarting all the resource-intensive user processes)
and last but not least,
o running with a cgd encrypted--but NOT-raidframe-backed--swap
device (/dev/cgd3c) also results in an assertation failure and
system panic once memory runs low (output hand copied, below):
panic: kernel diagnostic assertion "!ISSET(bp->b_oflags, DO_DONE)" failed: file
"/f/nb/6.x/src/sys/kern/vfs_bio.c", line 1497
cpu0: Begin traceback...
0x1000bdf0: at kern_assert+0x68
0x1000be30: at biodone+0xd0
0x1000be40: at dkiodone+0x94
0x1000be60: at biodone2+0x84
0x1000be70: at cgdiodone+0xd0
0x1000be90: at biodone2+0x84
0x1000bea0: at biointr+0xc4
0x1000bec0: at softint_dispatch+0x158
0x1000bf20: at softint_fast_dispatch+0xdc
0x1000bfe8: at 0xff7dedd8
trap: kernel read DSI trap @ 0xef589cff by 0x1d6024 (DSISR 0x40000000, err=14),
Press a key to panic.
So it seems I can have a stable system if I use a vanilla swap
partition, without cgd encryption and without raidframe; but I'd
rather have the advantages of those subsystems if at all possible.
> Certainly turn on DIAGNOSTIC. Compared to DEBUG and LOCKDEBUG it
> doesn't hurt, and I run machines with DIAGNOSTIC all the time.
Done. The above tests were done with fairly newly minted
-rnetbsd-6 DIAGNOSTIC-enabled kernel.
If you've read through all this, I appreciate it--I know it was
long. And thank you for any new insights on it that you can share.
Main Index |
Thread Index |