NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
The following reply was made to PR port-sparc64/46260; it has been noted by
GNATS.
From: Julian Coleman <jdc%coris.org.uk@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: port-sparc64/46260: gem0 driver fails to recover after RX overflow
Date: Fri, 8 Jun 2012 10:48:56 +0100
Hi,
I've had a chance to look at this some more. I've been testing on a V120. A
summary would be that I've found this bug very hard to reproduce under normal
conditions. However, adding extra debugging output to the driver makes it a
lot easier. For example, adding a printf in gem_rint() makes it likely that
I'll hit the RX overflow several times when copying over a new kernel to test.
Note, that the console is 9600 baud serial. I printed out the values of
sc->rxptr at the end of the interrupt function, and also the value of sc->rxptr
and the completion register when we overflow (I'd already verified that the
value of sc->rxptr is equal to the completion register at the end of the
interrupt function). I see output like:
gem0: gem_rint end sc->sc_rxptr = 6
gem0: receive error: RX overflow sc->rxptr 6, complete 6
gem0: gem_rint end sc->sc_rxptr = 7
when the receiver doesn't lock up, and:
gem0: gem_rint end sc->sc_rxptr = 100
gem0: receive error: RX overflow sc->rxptr 100, complete 100
gem0: receiver stuck in overflow, resetting
gem0: gem_rint end sc->sc_rxptr = 1
when it does. It is possible that the the chip has filled the whole ring when
it reports overflow, but I think that is fairly unlikely. However, I'm still
not sure why it locks up sometime, and especially more with 5 or 6. I've
also seen occasional:
gem0: rx_watchdog: not in overflow state: 0x810400
I think what sometimes happens here is that we get an RX_OVERFLOW that doesn't
lock up the receiver and also there are a low number of packets received at
this point. So, we can end up resetting when we don't need to. However, I
can't see the difference between the overflows that lock up and those that
don't. So, it seems best to reset here anyway.
> Yes. This is worrying. See the last paragraph of 2.6.1 "RxFIFO overflow"
> and also 2.3.2 "Frame Reception". An increase in overflows implies that the
> RX FIFO is not emptying fast enough, which implies that we are not reading
> and emptying packets from the ring buffer quickly enough when an interrupt
> occurs. Are you able to check earlier kernels (e.g. 5.0) to get a rough
> indication of when the increased resets problem started? I'm now unsure if
> this aspect is a gem(4) problem, or something else.
As I mentioned above, I don't think that we are filling up the ring buffer. I
had another look at the differences between the driver in netbsd-4-0 and in
netbsd-4. Apart from the difference between the settings of
GEM_MAC_CONTROL_MASK and GEM_INTMASK (we don't set GEM_INTR_PCS), I can't
see anything to cause this. I've checked the current code with the previous
setting of GEM_MAC_CONTROL_MASK and with GEM_INTR_PCS interrupts enabled, and
I didn't see any difference (I also didn't see any GEM_INTR_PCS interrupts).
To try and make the hardware move packets from the RX FIFO more quickly, I
altered the threshold in the GEM_RX_CONFIG register down to GEM_THRSH_64,
but this doesn't seem to make much difference.
Looking at the history, most of the current changes came in after 4.0 was
released, and were pulled up to the netbsd-4 branch. Is it possible to
try a netbsd-4 kernel, so that we can try and work out if the problem is
with these changes, or with something that happened later, please?
Thanks,
J
--
My other computer also runs NetBSD / Sailing at Newbiggin
http://www.netbsd.org/ / http://www.newbigginsailingclub.org/
Home |
Main Index |
Thread Index |
Old Index