Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow

To: port-sparc64-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,he%NetBSD.org@localhost
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
From: Julian Coleman <jdc%coris.org.uk@localhost>
Date: Fri, 8 Jun 2012 09:50:04 +0000 (UTC)

The following reply was made to PR port-sparc64/46260; it has been noted by 
GNATS.

From: Julian Coleman <jdc%coris.org.uk@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Date: Fri, 8 Jun 2012 10:48:56 +0100

 Hi,

 I've had a chance to look at this some more.  I've been testing on a V120.  A
 summary would be that I've found this bug very hard to reproduce under normal
 conditions.  However, adding extra debugging output to the driver makes it a
 lot easier.  For example, adding a printf in gem_rint() makes it likely that
 I'll hit the RX overflow several times when copying over a new kernel to test.
 Note, that the console is 9600 baud serial.  I printed out the values of
 sc->rxptr at the end of the interrupt function, and also the value of sc->rxptr
 and the completion register when we overflow (I'd already verified that the
 value of sc->rxptr is equal to the completion register at the end of the
 interrupt function).  I see output like:

   gem0: gem_rint end sc->sc_rxptr = 6
   gem0: receive error: RX overflow sc->rxptr 6, complete 6
   gem0: gem_rint end sc->sc_rxptr = 7

 when the receiver doesn't lock up, and:

   gem0: gem_rint end sc->sc_rxptr = 100
   gem0: receive error: RX overflow sc->rxptr 100, complete 100
   gem0: receiver stuck in overflow, resetting
   gem0: gem_rint end sc->sc_rxptr = 1

 when it does.  It is possible that the the chip has filled the whole ring when
 it reports overflow, but I think that is fairly unlikely.  However, I'm still
 not sure why it locks up sometime, and especially more with 5 or 6.  I've
 also seen occasional:

   gem0: rx_watchdog: not in overflow state: 0x810400

 I think what sometimes happens here is that we get an RX_OVERFLOW that doesn't
 lock up the receiver and also there are a low number of packets received at
 this point.  So, we can end up resetting when we don't need to.  However, I
 can't see the difference between the overflows that lock up and those that
 don't.  So, it seems best to reset here anyway.

 >  Yes.  This is worrying.  See the last paragraph of 2.6.1 "RxFIFO overflow"
 >  and also 2.3.2 "Frame Reception".  An increase in overflows implies that the
 >  RX FIFO is not emptying fast enough, which implies that we are not reading
 >  and emptying packets from the ring buffer quickly enough when an interrupt
 >  occurs.  Are you able to check earlier kernels (e.g. 5.0) to get a rough
 >  indication of when the increased resets problem started?  I'm now unsure if
 >  this aspect is a gem(4) problem, or something else.

 As I mentioned above, I don't think that we are filling up the ring buffer.  I
 had another look at the differences between the driver in netbsd-4-0 and in
 netbsd-4.  Apart from the difference between the settings of
 GEM_MAC_CONTROL_MASK and GEM_INTMASK (we don't set GEM_INTR_PCS), I can't
 see anything to cause this.  I've checked the current code with the previous
 setting of GEM_MAC_CONTROL_MASK and with GEM_INTR_PCS interrupts enabled, and
 I didn't see any difference (I also didn't see any GEM_INTR_PCS interrupts).

 To try and make the hardware move packets from the RX FIFO more quickly, I
 altered the threshold in the GEM_RX_CONFIG register down to GEM_THRSH_64,
 but this doesn't seem to make much difference.

 Looking at the history, most of the current changes came in after 4.0 was
 released, and were pulled up to the netbsd-4 branch.  Is it possible to
 try a netbsd-4 kernel, so that we can try and work out if the problem is
 with these changes, or with something that happened later, please?

 Thanks,

 J

 -- 
   My other computer also runs NetBSD    /        Sailing at Newbiggin
         http://www.netbsd.org/        /   http://www.newbigginsailingclub.org/

Prev by Date: PR/46561 CVS commit: src/lib/libc/net
Next by Date: bin/46564: yppush pushes map to himself triggering an error message each time
Previous by Thread: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Next by Thread: toolchain/46261: NetBSD 6.0_BETA and -current build fails in tools on some Linux systems
Indexes:

Home | Main Index | Thread Index | Old Index