Re: netbsd-6 instability - vmem

To: Greg Troxel <gdt%ir.bbn.com@localhost>
Subject: Re: netbsd-6 instability - vmem
From: Dave B <spam%y2013.dberg.net@localhost>
Date: Tue, 12 Mar 2013 13:52:41 -0400

On Thu, Feb 07, 2013 at 01:15:52PM -0500, Greg Troxel wrote:
> ... my working hypothesis is
> that kernel virtual address space is being exhausted, and that this
> isn't handled well.  So anything that causes more kernel virtual space
...
> If you can avoid running X, and then provoke a lockup, ddb may be
> interesting.  I found processes in vmem and tstile, and 'show pool'
> indicated failure of the pool code to get memory.
>
> Also, without X, if there are any disk issues, you're more likely to see
> the logs.

  The suspicion of memory exhaustion's having possibly caused my
system's issues seems to have been on target.  The short version:
whenever the system needed swap space, and I was using a more com-
plex subsystem than a basic disk block--such as raidframe or cgd--
as my swap device, a lock-up occurred, requiring a power-cycle.

  I don't know what the actual, fundamental issue is, nor how to
fix it, but I learned some specifics, which follow.

  Exploring the problem with no X, on the macppc box, I narrowed
down the circumstances under which lock-ups occur by trying various
swap configurations.  Normally, my swap partition is a cgd device
(e.g., /dev/cgd1b) backed by a raidframe mirror.  To test, I loaded
the system with a few artificially heavy CPU/IO/memory- intensive
user processes, and was able to find that

 o  running with NO swap partition seems to restore system stabil-
    ity--UVM does of course unapologetically kill userland process-
    es as memory runs low... but the system doesn't lock or crash

 o  starting out with no swap partition and then ADDING a vanilla--
    non-raidframe--swap device (once "top" showed low memory), such
    as /dev/wd0g, also seems to make the system stable--and without
    the penalty of UVM's killing userland processes--but also with-
    out the benefit of raidframe's resiliency or cgd's security

 o  OTOH starting with no swap partition, and then similarly adding
    a raidframe backed--but non-encrypted--device (again, once free
    memory runs low), such as /dev/raid1b, locks up the system

 o  likewise, starting out with swap already enabled on /dev/raid1b
    still results in a locked-up system, once free memory runs low
    and swapping or paging kicks in

 o  (interesting case, by contrast:)
   o  starting with no swap partition
   o  adding vanilla /dev/wd0g swap device
   o  adding a subsequent raidframe /dev/raid1b swap device, and,
   o  finally, removing /dev/wd0g as a swap device
    seems to leave the system stable (although I didn't try stop-
    ping and restarting all the resource-intensive user processes)

and last but not least,

 o  running with a cgd encrypted--but NOT-raidframe-backed--swap
    device (/dev/cgd3c) also results in an assertation failure and
    system panic once memory runs low (output hand copied, below):

===================================================================
panic: kernel diagnostic assertion "!ISSET(bp->b_oflags, DO_DONE)" failed: file
"/f/nb/6.x/src/sys/kern/vfs_bio.c", line 1497
cpu0: Begin traceback...
0x1000bdf0: at kern_assert+0x68
0x1000be30: at biodone+0xd0
0x1000be40: at dkiodone+0x94
0x1000be60: at biodone2+0x84
0x1000be70: at cgdiodone+0xd0
0x1000be90: at biodone2+0x84
0x1000bea0: at biointr+0xc4
0x1000bec0: at softint_dispatch+0x158
0x1000bf20: at softint_fast_dispatch+0xdc
0x1000bfe8: at 0xff7dedd8
trap: kernel read DSI trap @ 0xef589cff by 0x1d6024 (DSISR 0x40000000, err=14),
lr 0x1d663c
Press a key to panic.
===================================================================

  So it seems I can have a stable system if I use a vanilla swap
partition, without cgd encryption and without raidframe; but I'd
rather have the advantages of those subsystems if at all possible.

...
> Certainly turn on DIAGNOSTIC.  Compared to DEBUG and LOCKDEBUG it
> doesn't hurt, and I run machines with DIAGNOSTIC all the time.
...

  Done.  The above tests were done with fairly newly minted
-rnetbsd-6 DIAGNOSTIC-enabled kernel.

  If you've read through all this, I appreciate it--I know it was
long.  And thank you for any new insights on it that you can share.

-D

Follow-Ups:
- Re: netbsd-6 instability - vmem
  - From: Tom Ivar Helbekkmo
- Re: netbsd-6 instability - vmem
  - From: Greg Troxel

References:
- Re: netbsd-6 instability - vmem
  - From: Dave B
- Re: netbsd-6 instability - vmem
  - From: Greg Troxel
- Re: netbsd-6 instability - vmem
  - From: Greg Troxel
- Re: netbsd-6 instability - vmem
  - From: Greg Troxel
- Re: netbsd-6 instability - vmem
  - From: Dave B
- Re: netbsd-6 instability - vmem
  - From: Greg Troxel

Prev by Date: Re: a curious build failure (not the usual obviously bad changeset)
Next by Date: Re: netbsd-6 instability - vmem
Previous by Thread: Re: netbsd-6 instability - vmem
Next by Thread: Re: netbsd-6 instability - vmem
Indexes:

Home | Main Index | Thread Index | Old Index