current-users: Re: Fw: kern/28865: panic in in

Subject: Re: Fw: kern/28865: panic in in_cksum()
To: None <current-users@netbsd.org>
From: Paul Dokas <dokas@cs.umn.edu>
List: current-users
Date: 02/17/2005 22:34:26
On Mon, 24 Jan 2005 14:41:02 -0600, Paul Dokas <dokas@cs.umn.edu> wrote:
> Can anyone help shed a little light on this bug for me?  I'm getting tired of rebooting
> my machine every other morning  :-/
> 
> In particular, I'd find it useful to know what kind of information I should get after the
> machine panics.
> 
> Paul


Following up on my own follow up.  (Yes, I do sometime talk to myself.  Sometimes
even in public email lists apparently  ;-)


I think that I've got an idea about what's going on.  I think that there's a
serious bug somewhere in the fxp driver.  Possibly related to a thread from
tech-kern in Feb 2003.  Here's two of the more relevant emails:

  http://mail-index.netbsd.org/tech-kern/2003/02/11/0012.html
  http://mail-index.netbsd.org/tech-kern/2003/02/13/0013.html

Why do I think that what I'm seeing is related?  Well, I upgraded another
machine to -current as of Feb 14 and it started locking up after only 5
to 10 minutes worth of passing traffic a large amount of random traffic
(it's a firewall for a network full of laptops).  The lockups were
happening while using this NIC:

  fxp0 at pci2 dev 7 function 0: i82559 Ethernet, rev 8
  fxp0: interrupting at irq 3
  fxp0: Ethernet address 00:02:b3:8c:2f:0e
  inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
  inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

When I'd break into the kernel debugger, it was stuck in fxp_rxintr().  This
lockup repeated itself a few times before I swapped out the NIC for one of these:

  fxp0 at pci2 dev 8 function 0: Intel i82557 Ethernet, rev 2
  fxp0: interrupting at irq 10
  fxp0: Enabling receiver lock-up workaround
  fxp0: Ethernet address 00:a0:c9:81:11:d9
  inphy0 at fxp1 phy 1: i82555 10/100 media interface, rev. 0
  inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

This time the machine ran for a few hours and then locked up in the fxp driver.
But still in a place that seems to deal with DMA.  Here's a portion of the
stack trace (copied by hand):

  Xspllower(7,c0fdff00,ffffffff,286,c0dea000) at netbsd:Xspllower+0xe
  m_freem(c0d3f500,0,52,c2507634,c0d3f500) at netbsd:m_freem_0x99
  fxp_start(c0dea044,c047aa9c,c0dea044,2,ca517024) at netbsd:fxp_start_0x2c4
  ether_output(c0dea044,c2506000,c0fe1d98,c0fc5df0,c2586000) at netbsd:ether_output+0x2dc
  ip_output(c2508000,0,c03fa1f4,1,8) at netbsd:ip_output_0x621
  ip_forward(c2506000,0,c0f7a000,1,0) at netbsd:ip_forward+0x16a
  ip_input(c2506000,0,0,246,0) at netbsd:ip_input+0x27b
  ipintr(928a0010,50030,cdba0010,c0470010,c0477000) at netbsd:ipintr+0x76
  DDB lost frame for netbsd:Xsoftnet+0x41, trying 0xc047ae80
  Xsoftnet() at netbsd:Xsoftnet+0x41


The final nail in this for me is that I swapped out the Intel NICs for a 3COM:

  ex0 at pci2 dev 8 function 0: 3Com 3c905C-TX 10/100 Ethernet with mngmt (rev. 0x78)
  ex0: interrupting at irq 10
  ex0: MAC address 00:04:75:c7:b4:b7
  exphy0 at ex0 phy 24: 3Com internal media interface
  exphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

and haven't had any problems since.


Given that I'm seeing lockups in fxp_rxintr and in the fxp driver in general
at places that seem to do with DMA, is it possible that the race condition
described in that first tech-kern email has surfaced for me?  Or am I just
reading into this too deeply?

I seriously hope that it's not the problem described in that thread because I've
got another machine that I'd _really_ like to upgrade to get to IPFilter 4.1.5.
But it's got wm NICs and according to this:

  http://mail-index.netbsd.org/tech-kern/2003/02/11/0018.html

it's likely to be affected by the same problem.



Paul
-- 
Paul Dokas                                            dokas@cs.umn.edu
======================================================================
Don Juan Matus:  "an enigma wrapped in mystery wrapped in a tortilla."