Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Crash on -current in pool_drain()



On Sun, 18 Oct 2015, Paul Goyette wrote:

On Sun, 18 Oct 2015, Nick Hudson wrote:

On 10/18/15 00:30, Paul Goyette wrote:
Under heavy load, and after several hours of building packages, I am
seeing the following crash.  I'm doing a bisect to narrow down more,
but it has been happening at least a week ago, with kernel and all
modules build from sources updated on 2015-10-13 at 08:30:00 UTC.

(This is on amd64)

Here's the backtrace from gdb:
[snip]

#8  0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
    at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
#9  0xffffffff802d1791 in uvm_pageout (arg=<optimized out>)
    at /build/netbsd-local/src/sys/uvm/uvm_pdaemon.c:343
#10 0xffffffff80100807 in lwp_trampoline ()
#11 0x0000000000000000 in ?? ()
(gdb) fr 8
#8  0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
    at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
1429                    if (drainpp == NULL) {
(gdb) disass pool_drain


I'm still working on a bisect - so far I have confirmed that the issue
occurs at least as far back as Oct 10, possibly longer.

My "reproduction" involves building a large number of packages, one at
a time, with MAKE_JOBS=3.  At first I wasn't paying much attention, but
all of the crashes I specifically remember were on the 359th package,
www/firefox !


=> 0xffffffff80333415 <+59>:    mov (%rax),%rdx

I think %rax will be "weird" and indicate pool_head list corruption - no idea why, though.

%rax looks reasonable:

(gdb) info reg
rax            0xffffffff8099fb40       -2137392320
rbx            0x0      0
rcx            0xffffffff80724880       -2139993984
rdx            0x0      0
...

and matches the value reported for drainpp

(gdb) print drainpp
$1 = (struct pool *) 0xffffffff8099fb40

and which also matches the tailq's tqh_last

$3 = {tqh_first = 0xffffffff80724880 <uvm_amap_cache>,
 tqh_last = 0xffffffff8099fb40}

However, it seems that something has been badly corrupted:

(gdb) print *drainpp
Cannot access memory at address 0xffffffff8099fb40

OK, I've narrowed this down even further...

It seems that a kernel built from '2015-10-10 04:30:00' works fine,
while three commits later, at '2015-10-10 06:00:00' fails with the
above backtrace.

Unfortunately, all three commits between those two time-stamps were
mine, so it would seem I broke something.   :)

The changes in question involve modifying the compat_netbsd32 module
to depend on the nfsserver and mqueue modules, and autoloading them
if not built-in to the kernel.

There is another, separately-reported issue [1], with the mqueue
module being auto-unloaded (panic is pool_cache_destroy(), so I'm
pretty sure that it is likely the cause of this newest problem,
too.  Unfortunately, while manually loading the mqueue module will
avoid the pool_cache_destroy() panic, it does not avoid this one
in pool_drain().

I'm planning to start an inspection of the mqueue code to see if I
can figure out where it is destroying things.  It's not an area
with which I am very familiar, so any other eyeballs focused on the
code would be appreciated.


[1] http://mail-index.netbsd.org/current-users/2015/10/16/msg028198.html

+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+


Home | Main Index | Thread Index | Old Index