Subject: Re: Fw: kern/28865: panic in in_cksum()
To: None <current-users@netbsd.org>
From: Paul Dokas <dokas@cs.umn.edu>
List: current-users
Date: 02/17/2005 22:34:26
On Mon, 24 Jan 2005 14:41:02 -0600, Paul Dokas <dokas@cs.umn.edu> wrote:
> Can anyone help shed a little light on this bug for me? I'm getting tired of rebooting
> my machine every other morning :-/
>
> In particular, I'd find it useful to know what kind of information I should get after the
> machine panics.
>
> Paul
Following up on my own follow up. (Yes, I do sometime talk to myself. Sometimes
even in public email lists apparently ;-)
I think that I've got an idea about what's going on. I think that there's a
serious bug somewhere in the fxp driver. Possibly related to a thread from
tech-kern in Feb 2003. Here's two of the more relevant emails:
http://mail-index.netbsd.org/tech-kern/2003/02/11/0012.html
http://mail-index.netbsd.org/tech-kern/2003/02/13/0013.html
Why do I think that what I'm seeing is related? Well, I upgraded another
machine to -current as of Feb 14 and it started locking up after only 5
to 10 minutes worth of passing traffic a large amount of random traffic
(it's a firewall for a network full of laptops). The lockups were
happening while using this NIC:
fxp0 at pci2 dev 7 function 0: i82559 Ethernet, rev 8
fxp0: interrupting at irq 3
fxp0: Ethernet address 00:02:b3:8c:2f:0e
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
When I'd break into the kernel debugger, it was stuck in fxp_rxintr(). This
lockup repeated itself a few times before I swapped out the NIC for one of these:
fxp0 at pci2 dev 8 function 0: Intel i82557 Ethernet, rev 2
fxp0: interrupting at irq 10
fxp0: Enabling receiver lock-up workaround
fxp0: Ethernet address 00:a0:c9:81:11:d9
inphy0 at fxp1 phy 1: i82555 10/100 media interface, rev. 0
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
This time the machine ran for a few hours and then locked up in the fxp driver.
But still in a place that seems to deal with DMA. Here's a portion of the
stack trace (copied by hand):
Xspllower(7,c0fdff00,ffffffff,286,c0dea000) at netbsd:Xspllower+0xe
m_freem(c0d3f500,0,52,c2507634,c0d3f500) at netbsd:m_freem_0x99
fxp_start(c0dea044,c047aa9c,c0dea044,2,ca517024) at netbsd:fxp_start_0x2c4
ether_output(c0dea044,c2506000,c0fe1d98,c0fc5df0,c2586000) at netbsd:ether_output+0x2dc
ip_output(c2508000,0,c03fa1f4,1,8) at netbsd:ip_output_0x621
ip_forward(c2506000,0,c0f7a000,1,0) at netbsd:ip_forward+0x16a
ip_input(c2506000,0,0,246,0) at netbsd:ip_input+0x27b
ipintr(928a0010,50030,cdba0010,c0470010,c0477000) at netbsd:ipintr+0x76
DDB lost frame for netbsd:Xsoftnet+0x41, trying 0xc047ae80
Xsoftnet() at netbsd:Xsoftnet+0x41
The final nail in this for me is that I swapped out the Intel NICs for a 3COM:
ex0 at pci2 dev 8 function 0: 3Com 3c905C-TX 10/100 Ethernet with mngmt (rev. 0x78)
ex0: interrupting at irq 10
ex0: MAC address 00:04:75:c7:b4:b7
exphy0 at ex0 phy 24: 3Com internal media interface
exphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
and haven't had any problems since.
Given that I'm seeing lockups in fxp_rxintr and in the fxp driver in general
at places that seem to do with DMA, is it possible that the race condition
described in that first tech-kern email has surfaced for me? Or am I just
reading into this too deeply?
I seriously hope that it's not the problem described in that thread because I've
got another machine that I'd _really_ like to upgrade to get to IPFilter 4.1.5.
But it's got wm NICs and according to this:
http://mail-index.netbsd.org/tech-kern/2003/02/11/0018.html
it's likely to be affected by the same problem.
Paul
--
Paul Dokas dokas@cs.umn.edu
======================================================================
Don Juan Matus: "an enigma wrapped in mystery wrapped in a tortilla."