Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Crash on -current in pool_drain()



On Sun, 18 Oct 2015, Nick Hudson wrote:

On 10/18/15 00:30, Paul Goyette wrote:
Under heavy load, and after several hours of building packages, I am
seeing the following crash.  I'm doing a bisect to narrow down more,
but it has been happening at least a week ago, with kernel and all
modules build from sources updated on 2015-10-13 at 08:30:00 UTC.

(This is on amd64)

Here's the backtrace from gdb:
[snip]

#8  0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
    at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
#9  0xffffffff802d1791 in uvm_pageout (arg=<optimized out>)
    at /build/netbsd-local/src/sys/uvm/uvm_pdaemon.c:343
#10 0xffffffff80100807 in lwp_trampoline ()
#11 0x0000000000000000 in ?? ()
(gdb) fr 8
#8  0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
    at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
1429                    if (drainpp == NULL) {
(gdb) disass pool_drain


This looks like one of the crashes riz@ had on a tegra which I think was also
building packages.

I'm still working on a bisect - so far I have confirmed that the issue
occurs at least as far back as Oct 10, possibly longer.

My "reproduction" involves building a large number of packages, one at
a time, with MAKE_JOBS=3.  At first I wasn't paying much attention, but
all of the crashes I specifically remember were on the 359th package,
www/firefox !


=> 0xffffffff80333415 <+59>:    mov (%rax),%rdx

I think %rax will be "weird" and indicate pool_head list corruption - no idea why, though.

%rax looks reasonable:

(gdb) info reg
rax            0xffffffff8099fb40       -2137392320
rbx            0x0      0
rcx            0xffffffff80724880       -2139993984
rdx            0x0      0
...

and matches the value reported for drainpp

(gdb) print drainpp
$1 = (struct pool *) 0xffffffff8099fb40

and which also matches the tailq's tqh_last

$3 = {tqh_first = 0xffffffff80724880 <uvm_amap_cache>,
  tqh_last = 0xffffffff8099fb40}

However, it seems that something has been badly corrupted:

(gdb) print *drainpp
Cannot access memory at address 0xffffffff8099fb40




+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+


Home | Main Index | Thread Index | Old Index