NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow



The following reply was made to PR port-sparc64/46260; it has been noted by 
GNATS.

From: Julian Coleman <jdc%coris.org.uk@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Date: Fri, 8 Jun 2012 10:48:56 +0100

 Hi,
 
 I've had a chance to look at this some more.  I've been testing on a V120.  A
 summary would be that I've found this bug very hard to reproduce under normal
 conditions.  However, adding extra debugging output to the driver makes it a
 lot easier.  For example, adding a printf in gem_rint() makes it likely that
 I'll hit the RX overflow several times when copying over a new kernel to test.
 Note, that the console is 9600 baud serial.  I printed out the values of
 sc->rxptr at the end of the interrupt function, and also the value of sc->rxptr
 and the completion register when we overflow (I'd already verified that the
 value of sc->rxptr is equal to the completion register at the end of the
 interrupt function).  I see output like:
 
   gem0: gem_rint end sc->sc_rxptr = 6
   gem0: receive error: RX overflow sc->rxptr 6, complete 6
   gem0: gem_rint end sc->sc_rxptr = 7
 
 when the receiver doesn't lock up, and:
 
   gem0: gem_rint end sc->sc_rxptr = 100
   gem0: receive error: RX overflow sc->rxptr 100, complete 100
   gem0: receiver stuck in overflow, resetting
   gem0: gem_rint end sc->sc_rxptr = 1
 
 when it does.  It is possible that the the chip has filled the whole ring when
 it reports overflow, but I think that is fairly unlikely.  However, I'm still
 not sure why it locks up sometime, and especially more with 5 or 6.  I've
 also seen occasional:
 
   gem0: rx_watchdog: not in overflow state: 0x810400
 
 I think what sometimes happens here is that we get an RX_OVERFLOW that doesn't
 lock up the receiver and also there are a low number of packets received at
 this point.  So, we can end up resetting when we don't need to.  However, I
 can't see the difference between the overflows that lock up and those that
 don't.  So, it seems best to reset here anyway.
 
 >  Yes.  This is worrying.  See the last paragraph of 2.6.1 "RxFIFO overflow"
 >  and also 2.3.2 "Frame Reception".  An increase in overflows implies that the
 >  RX FIFO is not emptying fast enough, which implies that we are not reading
 >  and emptying packets from the ring buffer quickly enough when an interrupt
 >  occurs.  Are you able to check earlier kernels (e.g. 5.0) to get a rough
 >  indication of when the increased resets problem started?  I'm now unsure if
 >  this aspect is a gem(4) problem, or something else.
 
 As I mentioned above, I don't think that we are filling up the ring buffer.  I
 had another look at the differences between the driver in netbsd-4-0 and in
 netbsd-4.  Apart from the difference between the settings of
 GEM_MAC_CONTROL_MASK and GEM_INTMASK (we don't set GEM_INTR_PCS), I can't
 see anything to cause this.  I've checked the current code with the previous
 setting of GEM_MAC_CONTROL_MASK and with GEM_INTR_PCS interrupts enabled, and
 I didn't see any difference (I also didn't see any GEM_INTR_PCS interrupts).
 
 To try and make the hardware move packets from the RX FIFO more quickly, I
 altered the threshold in the GEM_RX_CONFIG register down to GEM_THRSH_64,
 but this doesn't seem to make much difference.
 
 Looking at the history, most of the current changes came in after 4.0 was
 released, and were pulled up to the netbsd-4 branch.  Is it possible to
 try a netbsd-4 kernel, so that we can try and work out if the problem is
 with these changes, or with something that happened later, please?
 
 Thanks,
 
 J
 
 -- 
   My other computer also runs NetBSD    /        Sailing at Newbiggin
         http://www.netbsd.org/        /   http://www.newbigginsailingclub.org/
 


Home | Main Index | Thread Index | Old Index