Subject: tlp0 weirdness.
To: None <netbsd-help@netbsd.org>
From: Richard Rauch <rkr@olib.org>
List: netbsd-help
Date: 04/17/2003 20:33:28
I've noticed something lately about one of my ethernet cards, a tlp-using
card.  Sometimes, after a reboot (which is rare enough that I've taken
some time to gather info), I will get lots of errors with the card when
moving data fast.  E.g., ssh over to my mail server and cat my mailbox.

The errors are of this form:

 /~~~ errors

tlp0: receive error: CRC error
tlp0: receive error: MII error
tlp0: receive error: dribbling bit
tlp0: receive error: CRC error

 \___ errors

...with the "dribbling bit" coming at intermittant points.

At first, I thought that it was 1.6 being flakey, or maybe a
card was damaged when I shuffled cards around, etc.  Or even a
bad PCI slot.  However, on some reboots, the problem went away.
This was without any physical contact to the card: It would sometimes
just decide to behave, and other times not.

I should also note that I have another tlp-using card in another
machine.  It does not experience these problems.  (I haven't
tried swapping the cards to see if it could be motherboard/chipset
weirdness.  The *other* machine is under a 19" CRT and I am
disinclined to get into that box, due to lack of a good place to
put the monitor and difficulty getting it re-seated on its swivel/tilt
base.)


Today, I tried something on a hunch, since I had recently rebooted
it and it was "resisting" booting into a "happy" state.  Instead of
rebooting it N times until it happened to be happy, I did
"ifconfig tlp0 down" followed by "ifconfig tlp0 up".  Doing it once
didn't make any difference.  A second time, though, caused something
to click and suddenly it was happy.

Now this provides some more real information, and is also something
that I can repeat (I assume that taking it down and back up will
roll the dice and possibly make it misbehave again).

I never had such problems under 1.5, as I recall.


The misbehaving card is probed as:

 /~~~ probe

tlp0 at pci0 dev 13 function 0: Macronix MX98715AEC-x Ethernet, pass 2.5
tlp0: interrupting at irq 10
tlp0: Ethernet address 00:80:c6:f9:bc:35
tlp0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

 \___ probe

...and is configured:

 /~~~ ifconfig

inet phermes

 \___ ifconfig

(phermes is a private 10.* address.)

The kernel is:

NetBSD hermes 1.6 NetBSD 1.6 (hermes) #0: Sun Dec  1 00:03:04 CST 2002     root@hermes:/usr/src/sys/arch/i386/compile/hermes i386

(Generic NetBSD/i386 1.6 sources with a custom config.)

This misbehaving card is in an Athlon box with a VIA chipset that has
generally given me little grief.  (The sound chips don't record, but they
play back nicely; the box seems physically stable.  I've had it for about
3 years.)


The "happy" tlp card probes:

 /~~~ probe

tlp0 at pci0 dev 16 function 0: Lite-On 82C169 Ethernet, pass 2.0
tlp0: interrupting at irq 11
tlp0: Ethernet address 00:a0:cc:23:a9:b6
ukphy1 at tlp0 phy 1: Generic IEEE 802.3u media interface

 \___ probe

(Sidenote: Does it matter that this one gets a PHY, while the other didn't?)

...ifconfig and uname are essentially the same, mod a different hostname.

...this "happy" card is in an old PII Gateway 2000 machine.  I've had
it for over 5 years and only a dead monitor and dead CD-ROM drive have
ever been a problem.  It's now a headless mailserver and backup/slave DNS.


Some additional info (all applying to when the card is misbehaving):

If I have an X server up and do these things from X, I see very noticable
stop-and-go behavior when using cat remotely on a large text file into an
xterm.  I can do at least the cat from a console window on a 150K text
file with no noticable lag (though it appears to still stuff errors into
dmesg's buffer).  Under X, the font size seems to have a small, but
significant impact on the errors.  I still get errors even if I
pipe ssh's output into, say, "wc", so while displaying the output
is significant, it is not essential to the problem.

A certain amount of data (maybe around 5K; it is not a fixed sum)
can be sent without errors.  It is not clear if it is just a random
(or semi-regular, long-orbiting) strike, or if it has to do with some
buffer overflowing.

When dealing with large enough data to induce these errors, there
are approx. 1-second pauses scattered at (varying) locations in
the stream.  This greatly reduces throughput.


Any ideas?  Or does this information shed light on bugs that others
have been seeing?


-- 
  "I probably don't know what I'm talking about."  --rkr@olib.org