Subject: Question about TCP lockups with Ethernet
To: None <current-users@NetBSD.ORG>
From: Brian Buhrow <buhrow@cats.ucsc.edu>
List: current-users
Date: 03/04/1995 21:07:23
	Before I get started on my problem, allow me to explain that, although
I am running NetBSD 0.9A of a very old vintage(early February, 1994) on an
I386 box, I don't believe my trouble is I386 or NetBSD 0.9A specific.  That
is, I've seen reports of similar behavior under NetBSD 1.0 and no real
explanation of what is going on.  With that out of the way, here is my
trouble.

	The way I use my NetBSD box is to log in from an Annex terminal 
server, and then log in, on an as needed basis, to various machines around
our network.  If, while telnetting from the NetBSD box to another machine,
the machine can be anything, a sun, an sgi, a router, etc., I send a screen
full or more of data back to myself from the remote host, my connection
freezes.  That is, the connection between the terminal server and the
NetBSD box freezes.  If I re-log into the NetBSD box, I will find that the
send-queue from the NetBSD box to the Annex terminal server is very full
and that the EtherNET driver has had to reset the ethernet controller due
to a timeout error.  The errors look like this:

Mar  4 18:17:11 baloo vmunix: ed2: remote transmit DMA failed to complete
Mar  4 18:24:41 baloo vmunix: ed2: remote transmit DMA failed to complete
Mar  4 18:28:02 baloo vmunix: ed2: remote transmit DMA failed to complete
Mar  4 18:28:05 baloo vmunix: ed2: remote transmit DMA failed to complete

(Don't mind the vmunix, it comes from the modified version of syslogd we
run around here)
	Sometimes the above messages are accompanied by messages about invalid
packet lengths.

The TCP queue might look like:

Active Internet connections
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp        0      1570  baloo.UCSC.EDU.telnet  annex-test.UCSC..10321 ESTABLISHED

	I understand that it is probably a hardware problem with the Ethernet
controller which is at the bottom of this, but, what causes the TCP
retransmit timer not to timeout and retry after the initial effort failed?
The Ethernet card was reset, other connections continue to work, UDP
traffic is unaffected, and there is no need to do anything with the machine
at all.  Am I mistaken in the belief that TCP is supposed to keep retrying
transmissions with an increasing backoff timer until the connection timeout
timer expires?  It is as if the TCP layer was told that the packet was
transmitted, and that it saw no reason to retry transmitting the packet.
Thus, the connection timer timed out and the connection was ultimately
dropped.  
	I have seen others report similar problems with NetBSD 1.0
and PPP.  I am willing to look into the problem, if someone could give me a
clue as to how to trace the levels from the EtherNET packet layer up to the
TCP layer.  What files should I look at?  What books should I read?

Any suggestions, points of insight, etc. would be most appreciated.
-thanks
-Brian

P.S.  For the curious, I am running:
80486DX-20 (8 MB memory) with a NE2000 clone (if_ed driver).