netbsd-help: tlp0 weirdness.

Subject: tlp0 weirdness.
To: None <netbsd-help@netbsd.org>
From: Richard Rauch <rkr@olib.org>
List: netbsd-help
Date: 04/17/2003 20:33:28

I've noticed something lately about one of my ethernet cards, a tlp-using
card. Sometimes, after a reboot (which is rare enough that I've taken
some time to gather info), I will get lots of errors with the card when
moving data fast. E.g., ssh over to my mail server and cat my mailbox.

The errors are of this form:

/~~~ errors

tlp0: receive error: CRC error
tlp0: receive error: MII error
tlp0: receive error: dribbling bit
tlp0: receive error: CRC error

\___ errors

...with the "dribbling bit" coming at intermittant points.

At first, I thought that it was 1.6 being flakey, or maybe a
card was damaged when I shuffled cards around, etc. Or even a
bad PCI slot. However, on some reboots, the problem went away.
This was without any physical contact to the card: It would sometimes
just decide to behave, and other times not.

I should also note that I have another tlp-using card in another
machine. It does not experience these problems. (I haven't
tried swapping the cards to see if it could be motherboard/chipset
weirdness. The *other* machine is under a 19" CRT and I am
disinclined to get into that box, due to lack of a good place to
put the monitor and difficulty getting it re-seated on its swivel/tilt
base.)

Today, I tried something on a hunch, since I had recently rebooted
it and it was "resisting" booting into a "happy" state. Instead of
rebooting it N times until it happened to be happy, I did
"ifconfig tlp0 down" followed by "ifconfig tlp0 up". Doing it once
didn't make any difference. A second time, though, caused something
to click and suddenly it was happy.

Now this provides some more real information, and is also something
that I can repeat (I assume that taking it down and back up will
roll the dice and possibly make it misbehave again).

I never had such problems under 1.5, as I recall.

The misbehaving card is probed as:

/~~~ probe

tlp0 at pci0 dev 13 function 0: Macronix MX98715AEC-x Ethernet, pass 2.5
tlp0: interrupting at irq 10
tlp0: Ethernet address 00:80:c6:f9:bc:35
tlp0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

\___ probe

...and is configured:

/~~~ ifconfig

inet phermes

\___ ifconfig

(phermes is a private 10.* address.)

The kernel is:

NetBSD hermes 1.6 NetBSD 1.6 (hermes) #0: Sun Dec 1 00:03:04 CST 2002 root@hermes:/usr/src/sys/arch/i386/compile/hermes i386

(Generic NetBSD/i386 1.6 sources with a custom config.)

This misbehaving card is in an Athlon box with a VIA chipset that has
generally given me little grief. (The sound chips don't record, but they
play back nicely; the box seems physically stable. I've had it for about
3 years.)

The "happy" tlp card probes:

/~~~ probe

tlp0 at pci0 dev 16 function 0: Lite-On 82C169 Ethernet, pass 2.0
tlp0: interrupting at irq 11
tlp0: Ethernet address 00:a0:cc:23:a9:b6
ukphy1 at tlp0 phy 1: Generic IEEE 802.3u media interface

\___ probe

(Sidenote: Does it matter that this one gets a PHY, while the other didn't?)

...ifconfig and uname are essentially the same, mod a different hostname.

...this "happy" card is in an old PII Gateway 2000 machine. I've had
it for over 5 years and only a dead monitor and dead CD-ROM drive have
ever been a problem. It's now a headless mailserver and backup/slave DNS.

Some additional info (all applying to when the card is misbehaving):

If I have an X server up and do these things from X, I see very noticable
stop-and-go behavior when using cat remotely on a large text file into an
xterm. I can do at least the cat from a console window on a 150K text
file with no noticable lag (though it appears to still stuff errors into
dmesg's buffer). Under X, the font size seems to have a small, but
significant impact on the errors. I still get errors even if I
pipe ssh's output into, say, "wc", so while displaying the output
is significant, it is not essential to the problem.

A certain amount of data (maybe around 5K; it is not a fixed sum)
can be sent without errors. It is not clear if it is just a random
(or semi-regular, long-orbiting) strike, or if it has to do with some
buffer overflowing.

When dealing with large enough data to induce these errors, there
are approx. 1-second pauses scattered at (varying) locations in
the stream. This greatly reduces throughput.

Any ideas? Or does this information shed light on bugs that others
have been seeing?

--
"I probably don't know what I'm talking about." --rkr@olib.org