Re: Crash on -current in pool_drain()

To: Nick Hudson <skrll%netbsd.org@localhost>
Subject: Re: Crash on -current in pool_drain()
From: Paul Goyette <paul%vps1.whooppee.com@localhost>
Date: Sun, 18 Oct 2015 16:19:28 +0800 (PHT)

On Sun, 18 Oct 2015, Nick Hudson wrote:

On 10/18/15 00:30, Paul Goyette wrote:

Under heavy load, and after several hours of building packages, I am
seeing the following crash.  I'm doing a bisect to narrow down more,
but it has been happening at least a week ago, with kernel and all
modules build from sources updated on 2015-10-13 at 08:30:00 UTC.

(This is on amd64)

Here's the backtrace from gdb:

[snip]

#8  0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
    at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
#9  0xffffffff802d1791 in uvm_pageout (arg=<optimized out>)
    at /build/netbsd-local/src/sys/uvm/uvm_pdaemon.c:343
#10 0xffffffff80100807 in lwp_trampoline ()
#11 0x0000000000000000 in ?? ()
(gdb) fr 8
#8  0xffffffff80333415 in pool_drain (ppp=ppp@entry=0xfffffe810f528e30)
    at /build/netbsd-local/src/sys/kern/subr_pool.c:1429
1429                    if (drainpp == NULL) {
(gdb) disass pool_drain



This looks like one of the crashes riz@ had on a tegra which I think was also
building packages.


I'm still working on a bisect - so far I have confirmed that the issue
occurs at least as far back as Oct 10, possibly longer.

My "reproduction" involves building a large number of packages, one at
a time, with MAKE_JOBS=3.  At first I wasn't paying much attention, but
all of the crashes I specifically remember were on the 359th package,
www/firefox !

=> 0xffffffff80333415 <+59>:    mov (%rax),%rdx
I think %rax will be "weird" and indicate pool_head list corruption - no ideawhy, though.


%rax looks reasonable:

(gdb) info reg
rax            0xffffffff8099fb40       -2137392320
rbx            0x0      0
rcx            0xffffffff80724880       -2139993984
rdx            0x0      0
...

and matches the value reported for drainpp

(gdb) print drainpp
$1 = (struct pool *) 0xffffffff8099fb40

and which also matches the tailq's tqh_last

$3 = {tqh_first = 0xffffffff80724880 <uvm_amap_cache>,
  tqh_last = 0xffffffff8099fb40}

However, it seems that something has been badly corrupted:

(gdb) print *drainpp
Cannot access memory at address 0xffffffff8099fb40




+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+

Follow-Ups:
- Re: Crash on -current in pool_drain()
  - From: Paul Goyette

References:
- Crash on -current in pool_drain()
  - From: Paul Goyette
- Re: Crash on -current in pool_drain()
  - From: Nick Hudson

Prev by Date: Re: Crash on -current in pool_drain()
Next by Date: Re: Does wscons use compat syscalls to switch sessions
Previous by Thread: Re: Crash on -current in pool_drain()
Next by Thread: Re: Crash on -current in pool_drain()
Indexes:

Home | Main Index | Thread Index | Old Index