netbsd-bugs: kern/28865: panic in in

Subject: kern/28865: panic in in_cksum()
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <dokas@cs.umn.edu>
List: netbsd-bugs
Date: 01/04/2005 19:24:00
>Number:         28865
>Category:       kern
>Synopsis:       panic in in_cksum()
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Jan 04 19:24:00 +0000 2005
>Originator:     Paul Dokas
>Release:        NetBSD 2.99.11
>Organization:
University of Computer Science, Dept of Computer Science
>Environment:
System: NetBSD host.cs.umn.edu 2.99.11 NetBSD 2.99.11 (HOST) #10: Thu Dec 23 10:13:41 CST 2004 root@host.cs.umn.edu:/usr/obj/sys/arch/i386/compile/HOST i386
Architecture: i386
Machine: i386
>Description:

  I've got a host that is a syslog collector that is crashing under load in in_cksum().
This started happening after I rebuilt the system on Dec 20, 2004.  I suspect that it's
related to the checksumming related changes that happened in the first half of Dec.

  Here's the panic information (copied by hand):j

kernel: page fault trap, code=0
stopped in pid 10628.1 (logsurfer) at netbsd:in_cksum+0x9e   adcl   0x1c(%eb), %eax
db> bt
in_cksum(ca52f000,0,5dc,9000001,13092600) at netbsd:in_cksum+0x9e
?(c124b534,0,0,2,1c082500) at 0
?(c124c820,0,0,0,1c1b8800) at 0
?(c12cc620,48101180,0,0,11e15600) at 0
?(c130bd20,0,0,0,3c17d00) at 0
Bad frame pointer:  0c130b500
db> show reg
dx	0x10
es	0x10
fs	0x30
gs	0x10
edi	0x14
esi	0
ebp	0xc0fed600	pnpbios_softc+0xbcf37c
ebx	0xc0ffafe4	pnpbios_softc+0xbdcd60
edx	0xbc
ecx	0xc0ff6d00	pnpbios_softc+0xbd8a7c
eax	0x17626744
eip	0xc029899a	in_cksum+0x9e
cs	0x8
eflags	0x10217
esp	0xcb4bfb44	pnpbios_softc+0xb0518c0
ss	0x10


  Here's a little more background on this machine:

    + it's collecting syslogs from around a 1,000 other computers
    + it's got a couple of IPSEC tunnels similar to this:

        spdadd 128.101.this.host/32 146.57.that.host/32 any -P out ipsec esp/transport//require ah/transport//require;
        spdadd 146.57.that.host/32 128.101.this.host/32 any -P in ipsec esp/transport//require ah/transport//require;

    + it's using IPFilter to implement a "no inbound except for syslog and ssh" policy
    + it's using IPNat to get allow outbound RSH:

        map fxp0 0.0.0.0/0 -> 128.101.this.host/32 proxy port shell rcmd/tcp


  Also, these panics look to be load related.  They were happening pretty consistantly
at 5:20am every morning until I moved the crontab that fired off at that time.  Now
the panics are happening a fairly random times, but they only seem to happen when the
machine's load goes above 2.0

  And finally, when it does panic, the backtrace is always different, but always looks
like it's been corrupted somehow (to my uneducated eye that is).  If I had to guess, I'd
say that this looks like a stack corruption of some sort.

>How-To-Repeat:

  Build with -current and attempt to collect syslogs from a few thousand hosts.

>Fix:

  Sorry, I don't know.