Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Crash on -current in pool_drain()
On Sun, 18 Oct 2015, Paul Goyette wrote:
On Sun, 18 Oct 2015, Nick Hudson wrote:
On 10/18/15 00:30, Paul Goyette wrote:
Under heavy load, and after several hours of building packages, I am
seeing the following crash. I'm doing a bisect to narrow down more,
but it has been happening at least a week ago, with kernel and all
modules build from sources updated on 2015-10-13 at 08:30:00 UTC.
(This is on amd64)
Here's the backtrace from gdb:
[snip]
#8 0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
#9 0xffffffff802d1791 in uvm_pageout (arg=<optimized out>)
at /build/netbsd-local/src/sys/uvm/uvm_pdaemon.c:343
#10 0xffffffff80100807 in lwp_trampoline ()
#11 0x0000000000000000 in ?? ()
(gdb) fr 8
#8 0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
1429 if (drainpp == NULL) {
(gdb) disass pool_drain
I'm still working on a bisect - so far I have confirmed that the issue
occurs at least as far back as Oct 10, possibly longer.
My "reproduction" involves building a large number of packages, one at
a time, with MAKE_JOBS=3. At first I wasn't paying much attention, but
all of the crashes I specifically remember were on the 359th package,
www/firefox !
=> 0xffffffff80333415 <+59>: mov (%rax),%rdx
I think %rax will be "weird" and indicate pool_head list corruption - no
idea why, though.
%rax looks reasonable:
(gdb) info reg
rax 0xffffffff8099fb40 -2137392320
rbx 0x0 0
rcx 0xffffffff80724880 -2139993984
rdx 0x0 0
...
and matches the value reported for drainpp
(gdb) print drainpp
$1 = (struct pool *) 0xffffffff8099fb40
and which also matches the tailq's tqh_last
$3 = {tqh_first = 0xffffffff80724880 <uvm_amap_cache>,
tqh_last = 0xffffffff8099fb40}
However, it seems that something has been badly corrupted:
(gdb) print *drainpp
Cannot access memory at address 0xffffffff8099fb40
OK, I've narrowed this down even further...
It seems that a kernel built from '2015-10-10 04:30:00' works fine,
while three commits later, at '2015-10-10 06:00:00' fails with the
above backtrace.
Unfortunately, all three commits between those two time-stamps were
mine, so it would seem I broke something. :)
The changes in question involve modifying the compat_netbsd32 module
to depend on the nfsserver and mqueue modules, and autoloading them
if not built-in to the kernel.
There is another, separately-reported issue [1], with the mqueue
module being auto-unloaded (panic is pool_cache_destroy(), so I'm
pretty sure that it is likely the cause of this newest problem,
too. Unfortunately, while manually loading the mqueue module will
avoid the pool_cache_destroy() panic, it does not avoid this one
in pool_drain().
I'm planning to start an inspection of the mqueue code to see if I
can figure out where it is destroying things. It's not an area
with which I am very familiar, so any other eyeballs focused on the
code would be appreciated.
[1] http://mail-index.netbsd.org/current-users/2015/10/16/msg028198.html
+------------------+--------------------------+-------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
+------------------+--------------------------+-------------------------+
Home |
Main Index |
Thread Index |
Old Index