Subject: kern/5463: 3c905 driver drops packets on 100Mb network
To: None <gnats-bugs@gnats.netbsd.org>
From: None <abrown@cs.berkeley.edu>
List: netbsd-bugs
Date: 05/18/1998 13:44:39
>Number:         5463
>Category:       kern
>Synopsis:       3c905 driver drops bursts of packets on 100Mb net; breaks UDP
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon May 18 13:50:01 1998
>Last-Modified:
>Originator:     Aaron Brown
>Organization:
	UC Berkeley Computer Science Division
>Release:        NetBSD-1.3.1 (no fix has been made in -current, AFAIK)
>Environment:
i386, Pentium-II/300, Asus P2L97 motherboard,
3com 3c905 "Boomerang" NIC, 100 Mb half-duplex networking. 
On bootup, NIC probes as:
	ep0 at pci0 dev 12 function 0: 3Com 3C905 Ethernet
	ep0: MAC address 00:60:08:9f:41:0b
	ep0: 8KB word-wide FIFO, 3:5 Rx:Tx split, mii default mii
	ep0: interrupting at irq 11
ifconfig ep0 reports:
ep0: flags=8867<UP,BROADCAST,DEBUG,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        media: 100baseTX status: active
        inet 128.32.131.182 netmask 0xffffff00 broadcast 128.32.131.255

OS is stock 1.3.1 except that I patched in the "3:5 Rx:Tx split" message
from current (previously was "unknown Rx:Tx split")

System: NetBSD pesto.CS.Berkeley.EDU 1.3.1 NetBSD 1.3.1 (PESTO) #2: Thu Apr 23 19:48:41 PDT 1998 abrown@pesto.CS.Berkeley.EDU:/usr/src/sys/arch/i386/compile/PESTO i386
Architecture: i386

>Description:
When attached to a 100Mb network, the 3c905 driver drops the tail end of
bursts of packets of roughly 8K or longer, probably because the receive
FIFO fills up and is not serviced fast enough. This manifests itself 
as lots of serious problems. In particular, NFS doesn't work because it
is trying to use large packets, which fragment into many ethernet 
packets that are then sent in a burst from the server; the last fragment
is consistently dropped. Retries fail as well, since the entire burst
is resent. Using TCP mounts is not an option; the Solaris
server used appears incompatible with BSD's TCP/NFS protocol (the mount
hangs after inactivity, even with the patches in 1.3.1).
Similar problems occur occasionally with YP; TCP throughput is terrible,
probably because the window stays small.

The problem can be seen in the following portion of tcpdump output,
taken during a "ping -s8192 electra" (note the missing final
fragment in the reply):

13:31:48.185960 pesto.CS.Berkeley.EDU > electra.CS.Berkeley.EDU: icmp: echo request (frag 9642:1480@0+)
13:31:48.186177 pesto.CS.Berkeley.EDU > electra.CS.Berkeley.EDU: (frag 9642:1480@1480+)
13:31:48.186389 pesto.CS.Berkeley.EDU > electra.CS.Berkeley.EDU: (frag 9642:1480@2960+)
13:31:48.186602 pesto.CS.Berkeley.EDU > electra.CS.Berkeley.EDU: (frag 9642:1480@4440+)
13:31:48.186814 pesto.CS.Berkeley.EDU > electra.CS.Berkeley.EDU: (frag 9642:1480@5920+)
13:31:48.187025 pesto.CS.Berkeley.EDU > electra.CS.Berkeley.EDU: (frag 9642:800@7400)
13:31:48.189211 electra.CS.Berkeley.EDU > pesto.CS.Berkeley.EDU: icmp: echo reply (frag 5288:1480@0+)
13:31:48.189393 electra.CS.Berkeley.EDU > pesto.CS.Berkeley.EDU: (frag 5288:1480@1480+)
13:31:48.189568 electra.CS.Berkeley.EDU > pesto.CS.Berkeley.EDU: (frag 5288:1480@2960+)
13:31:48.189742 electra.CS.Berkeley.EDU > pesto.CS.Berkeley.EDU: (frag 5288:1480@4440+)
13:31:48.189923 electra.CS.Berkeley.EDU > pesto.CS.Berkeley.EDU: (frag 5288:1480@5920+)

When the problem occurs, if I've run "ifconfig ep0 debug", I get:
	ep0: RX overrun
	ep0: packet overrun
	ep0: RX overrun
	ep0: packet overrun
	ep0: packet overrun
	ep0: packet overrun
	<etc>
on the console.

Note that the problem DOES NOT occur with Redhat Linux and the v0.99
Linux Vortex driver (the bursts occur, but no packets are dropped),
although it did happen with the stock v0.46C vortex driver shipped with
Redhat. This suggests that it is a driver problem (probably not servicing
the FIFO fast enough--maybe DMA would help?), not a hardware problem
(and I can reproduce this on 3 other identically-configured machines
attached to the same network via different hubs). If it matters, the
servers that cause the most problematic bursts are actually on an
ATM network which is connected to the 100Mb Ethernet switch.
>How-To-Repeat:
Install NetBSD-1.3.1 on a machine with a 3c905 NIC. Attach to a 100Mb
network. Run "ping -s16384 <hostname>" for some other fast host on that
network. Watch all packets get dropped.
>Fix:
Improve the latency of the interrupt service path (the problem gets
worse with more interrupts). Maybe using the card's DMA engine would
help?
>Audit-Trail:
>Unformatted: