tech-net: Re: sk(4) gigabit on 10/100 hangs

Subject: Re: sk(4) gigabit on 10/100 hangs
To: Jan Schaumann <jschauma@netmeister.org>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-net
Date: 12/07/2004 04:10:43
    Date:        Fri, 3 Dec 2004 14:09:25 -0500
    From:        Jan Schaumann <jschauma@netmeister.org>
    Message-ID:  <20041203190925.GB19261@netmeister.org>

  | I just filed PR 28517 -- just wondering if anybody here has seen a
  | similar problem and if so what people are doing about it.  For those too
  | lazy to look at the pr, it's the problem that the sk(4) when connected
  | to a 10/100 switch will hand when sending (more than negligible) data.

I have a sk (on an ASUS P4P800) connected to a 10/100 switch.

The driver certainly needs some work still I think, but I don't see quite
the problem that you do.

For me, I sometimes have to boot a couple of times to get it working (I
sometimes get "sk0 watchdog" kernel messages at startup - while dhclient
is attempting to get an address - so basically 0 traffic load - and once
that happens, nothing seems able to make the transmit path work - receive
looks OK though).

Other boots, no watchdog timeout errors, but still no transmit path there.
At that point, usually a "ifconfog sk0 media auto" (which I think achieves
the same thing as "ifconfig sk0 down; ifconfig sk0 up" that others have
reported using - but with less typing needed) will help it along enough that
things recover.

For me, once working, it is stable (I do sometimes make it work pretty
hard - well, hard on a cheap 100 Mbps switch port talking to rather less
speedy systems).   I have seen other occasional later watchdog timer
messages, but they at least appear harmless.

I believe the problem I see may be related to attempting to transmit while
the phy is attempting media negotiation.   Or perhaps just in some particular
states of that (which may be why sometimes it hange completely, and other
times is recoverable).

It may also depend upon the precise nature of the switch in use.

For me, it takes a second or two to settle on 100FDX as the media rate.
During this time, dhclient (and rtsol) are going to have (attempted to)
transmit some packets.

My guess is that when this happens, something in the chip locks up, and
it transmits no more (most likely, stops generating transmit done
interrupts).   Obviously the packets that attempted to transmit when the
switch still hadn't settled on a data rate go nowhere useful.

I'm also assuming that the up/down (or media auto) sequence works, by
resetting the chip (NetBSD does far too much NIC chip resetting in
general, but here it is useful) - then as long as no packet was being
transmitted during the new link rate negotiation that occurs, everything
fixes itself.   For me, at least, my "media auto" generally happens during
one of the 10-20 second pauses that dhclient falls into, the rtsol
attempt only happens once, so generally the network is just dead at this
point, and no-one is attempting to send, which is all good.   On the other
hand, during the normal boot sequence, it is essentially guaranteed that
there will be several packets to send during the phy/switch negotiation
sequence.

kre

ps: I should point out that I'm running a slightly old -current on this
system - and I know that a sk driver fix has been committed more recently
that the code I'm running.    I keep trying to run more recent kernels to
see if that fix helped the problem I see, but I keep getting uvm related
panics in (fairly) recent -current kernels I've tried (one just as fsck
started running for example) so I haven't managed to do any kind of
real test on that yet - which is one reason I haven't mentioned any of
this before.   My most recent -current test was a week or so ago though,
I'll try another soon.