tech-kern: Re: Panic on a busy machine

Subject: Re: Panic on a busy machine
To: John Klos <john@ziaspace.com>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 09/27/2005 18:25:55
hi,

this appears to be an interaction between the TCP code and the pool code.
we're in the middle of destroying a TCP connection when we take another
network interrupt.  the interrupt is for receiving a packet, and we try
to allocate another mbuf so that we are ready to receive another packet.
but when pool_get() is called to allocate one, the pool code sees that
the pool has already allocatedthe maximum it is allowed, so it tries to
reclaim some of that space, which ends up calling back into the TCP code
and tripping over some inconsistent state (though I don't know exactly
what this last bit is).

can anyone see how the tcpcb is inconsistent during tcp_close() in such
a way that tcp_freeq() would have a problem with it?  if not, then one
way to prevent this particular crash would be to not call TCP_REASS_UNLOCK()
in tcp_close(), so that tcp_drain() would skip this tcpcb.

-Chuck


On Mon, Jul 11, 2005 at 03:46:30PM -0700, John Klos wrote:
> Hi,
> 
> On a busy machine which is otherwise completely stable hardwarewise (with 
> uptimes over a year), I saw this panic when the system was pushing 40-50 
> Mbps:
> 
> (lots of these)
> WARNING: mclpool limit reached; increase NMBCLUSTERS
> WARNING: mclpool limit reached; increase NMBCLUSTERS
> WARNING: mclpool limit reached; increase NMBCLUSTERS
> WARNING: mclpool limit reached; increase NMBCLUSTERS
> trap: kernel write DSI trap @ 0x42d2c006 by 0x11c55c (DSISR 0x42000000, 
> err=14)
> panic: trap
> Begin traceback...
> 0x004448a0: at trap+0xec
> 0x00444920: kernel DSI write trap @ 0x42d2c006 by tcp_freeq+0x5c: 
> srr1=0x9032
>             r1=0x4449e0 cr=0x40009032 xer=0 ctr=0x11c5c0 dsisr=0x42000000
> 0x004449e0: at ADBDevTable+0xffc03a14
> 0x00444a00: at tcp_drain+0x88
> 0x00444a30: at m_reclaim+0xe0
> 0x00444a50: at pool_get+0x310
> 0x00444a80: at pool_cache_get_paddr+0xe4
> 0x00444aa0: at ex_add_rxbuf+0xe8
> 0x00444ae0: at ex_intr+0x248
> 0x00444b20: at ext_intr+0x228
> 0x00444b60: at trapstart+0x868
> 0x00444bc0: at callout_schedule+0x68
> 0x00444c10: at tcp_close+0x2b4
> 0x00444c40: at tcp_input+0xd10
> 0x00444e70: at ip_input+0x76c
> 0x00444ed0: at ipintr+0x80
> 0x00444f00: at softintr__run+0x8c
> 0x00444f20: at do_pending_int+0x23c
> 0x00444f50: at splx+0x40
> 0x00444f60: at ext_intr+0x1a4
> 0x00444fa0: at trapstart+0x868
> 0xffffe750: at ADBDevTable+0x4bb72970
> trap: pid 343.1 (setiathome): kernel MCHK trap @ 0x2e90dc (SRR1=0x49030)
> panic: trap
> Faulted in mid-traceback; aborting...dumpsys: TBD
> rebooting
> 
> That shouldn't happen, should it?
> 
> (NetBSD 2 from netbsd-2 branch, Performa 6400 class machine)
> 
> Any ideas?
> 
> John Klos