Subject: kern/18414: tlp driver can "collect" ~10 packets in 1.6 before sending them
To: None <gnats-bugs@gnats.netbsd.org>
From: None <rauch@math.rice.edu>
List: netbsd-bugs
Date: 09/25/2002 01:26:56
>Number:         18414
>Category:       kern
>Synopsis:       tlp driver can "collect" ~10 packets in 1.6 before sending them
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Sep 25 01:27:00 PDT 2002
>Closed-Date:
>Last-Modified:
>Originator:     Richard Rauch
>Release:        1.6 kernel (1.5 userland)
>Organization:
n/a
>Environment:
NetBSD hermes 1.6 NetBSD 1.6 (hermes) #0: Mon Sep 23 16:14:11 CDT 2002     root@prometheus:/usr/src/sys/arch/i386/compile/hermes i386

(Also happens with GENERIC; the main difference of the custom kernel
is that I increased the SYSV shared memory pages to run Ogle.)
>Description:
First, I have two machines with tlp based ethernet cards.
One (fully 1.6 installed) is working normally, and using an older
ethernet card (still a Tulip clone, but an older one).  The other
(1.6 kernel, 1.5.2 userland) is not in a state where I'm ready to
upgrade the whole system yet.  It's using a newer SoHo-ware card,
and is the one having the problems.  (I assume that the 1.5.x userland
wouldn't cause the problem.  But I need/want the 1.6 kernel for
other reasons.)

After booting, network interaction is very poor with my tlp-using
PCI 10/100 card.  The initial appearance is that the network card is
simply not working.  But if you do a "ping", say, and let it sit,
then after a few seconds, you'll see a burst of "ping" packets
sent/received all at once (round-trip times seperated by ~1 second,
because the packets were (so far as ping is concerned) sent 1 second
apart).  With nothing else doing network activity, it takes 9 or 10
"ping" packets to cause the interface to actually send the packets.
(So every ~10 seconds, ping will show one with ~9000 ms, one with ~8000
ms, ... one with ~1000ms, and one with more realistic roundtrip times
in the 0-to-1 range.  I don't know how the 10th packet's roundtrip
time compares to normal behavior, but it's at least within an order of
magnitude or so of the normal time.)

Sorry, I don't have a sample to show of the ping behavior.  Using
workaround (a) (see below), I can eventually kick it into a working
state, and I'm reluctant to reboot it just now and fiddle with it
long enough to make it work again.  (^&  (Maybe this weekend I can
put aside time to have this machine down like that.  Right now, I
am trying to use it.  Email me if more info is required and I'll try
to collect it when I can afford the downtime.)

Doing an "ifconfig" on the interface will cause the ~10 packet queue to
flush even if underfull.

Without *something* to fill/flush the queue, the packets appear to
remain enqueued forever.

dmesg for the card reads:

 /~~~

tlp0 at pci0 dev 13 function 0: Macronix MX98715AEC-x Ethernet, pass 2.5
tlp0: interrupting at irq 10
tlp0: Ethernet address 00:80:c6:f9:bc:35
tlp0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
 
 \___

/etc/ifconfig.tlp0 reads:

 /~~~

media 100baseTX
inet hermes netmask 255.255.255.0

 \___

>How-To-Repeat:
Use this ethernet card in this machine with 1.6 kernel (I assume
userland doesn't matter).  Boot.  Try to use the network.

>Fix:
I don't know a proper fix at this time.

I can offer some workarounds that may help others who encounter this
problem, though:

(a) ifconfig down/up cycling the card seems to eventually kick it
out of this problem, and then it works normally.  (Continuing to
ifconfig cycle it may restore the problem; I haven't gone there.)
I do NOT see a pattern to this; I may have to ifconfig cycle the
interface man times before it starts working.  But once it works, I
can leave it alone and the network just works.

(b) ping -f, or similar, can send lots of small packets and hence
reduce the latency on the packets considerably.  (Highly interactive
stuff, especially with small packets, may still suffer profoundly,
but at least the queue won't stall forever.)

(c) Similar to (b), one could do "ifconfig tlp0 >/dev/null" in a tight
loop, since the ifconfig probe seems to flush the queue.

(d) Get a different card.  (^&

>Release-Note:
>Audit-Trail:
>Unformatted: