Subject: Re: kern/31455: ex (905[BC]) cards can hang in -current kernels
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Robert Elz <kre@munnari.OZ.AU>
List: netbsd-bugs
Date: 10/04/2005 16:15:03
The following reply was made to PR kern/31455; it has been noted by GNATS.

From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/31455: ex (905[BC]) cards can hang in -current kernels 
Date: Tue, 04 Oct 2005 23:05:14 +0700

 Christos Zoulas suggested a patch something like the one appended.
 With this change in place, the problem has gone, and I can no longer
 cause the problem to occur.
 
 I am a little concerned about what this would mean to a big-endian system
 with a PCI pus and an ex card in it, but I will leave that for someone
 who understands PCI on big-endian systems to contemplate.
 
 I have also caused the "excess collision" case (which is what was
 happening to me, and probably the only realistic way to force this)
 to count as an output error (which it is, on every ethernet chip I've
 ever programmed, and I assume it is here as well) - that enabled me to
 verify that the condition was still occurring (the test case had not
 altered) but is now being correctly handled.   I'd suggest that that
 part of the patch be retained (but I don't think counting excess collisions
 as a single collision makes any sense at all - either don't count, or
 count as 16 (or 14 if the first two are already accounted for) or something.
 But not 1 collision...
 
 While here two other things in elinkxl.c for someone to consider ...
 
 First, the bus_space_read_1() in ex_probemedia() that reads 
 ELINK_W3_RESET_OPTIONS is (according to elin3kreg.h) really reading a
 16 bit register - and is putting the result in a u_int16_t type variable.
 This seems harmless, only the low 8 bits are ever used in elinkxl.c.
 
 And last, the manipulation of the tx_succ_ok variable (field) looks
 bogus to be - doing a test like
 	if (sc->tx_succ_ok < 100)
 along with code that increments the value like
 	sc->tx_succ_ok = (sc->tx_succ_ok+1) & 127;
 makes no sense at - the test cannot tell the difference between 7 and
 135 increments, yet that is clearly what it is wanting to do.
 The increment should probably be pegging the value at 127 (or anything
 bigger than 100) once it reaches that high, as < 100 is all that
 seems to matter, once it is bigger, how much bigger is irrelevant,
 until the value is reset to 0 and it all starts again.
 
 I'll leave that one for someone who understands what is really happening
 to ponder.
 
 kre
 
 ps: the trivial patch is appended - please think about what is really
 happening here, don't just blindly apply it.
 
 Index: elinkxl.c
 ===================================================================
 RCS file: /cvsroot/NetBSD/src/sys/dev/ic/elinkxl.c,v
 retrieving revision 1.83
 diff -u -r1.83 elinkxl.c
 --- elinkxl.c	31 May 2005 01:48:22 -0000	1.83
 +++ elinkxl.c	4 Oct 2005 15:48:01 -0000
 @@ -789,9 +789,10 @@
  	/*
  	 * We need to read+write TX_STATUS until we get a 0 status
  	 * in order to turn off the interrupt flag.
 +	 * ELINK_TXSTATUS is in the upper byte of 2 with ELINK_TIMER
  	 */
 -	while ((i = bus_space_read_1(iot, ioh, ELINK_TXSTATUS)) & TXS_COMPLETE) {
 -		bus_space_write_1(iot, ioh, ELINK_TXSTATUS, 0x0);
 +	while ((i = bus_space_read_2(iot, ioh, ELINK_TIMER)) & TXS_COMPLETE) {
 +		bus_space_write_2(iot, ioh, ELINK_TIMER, 0x0);
  
  		if (i & TXS_JABBER) {
  			++sc->sc_ethercom.ec_if.if_oerrors;
 @@ -813,6 +814,7 @@
  			ex_init(ifp);
  			/* TODO: be more subtle here */
  		} else if (i & TXS_MAX_COLLISION) {
 +			++sc->sc_ethercom.ec_if.if_oerrors;
  			++sc->sc_ethercom.ec_if.if_collisions;
  			bus_space_write_2(iot, ioh, ELINK_COMMAND, TX_ENABLE);
  			sc->sc_ethercom.ec_if.if_flags &= ~IFF_OACTIVE;