Subject: Re: fxp bug triggering hung vnlock()'ed NFS client
To: Artem Belevich <art@riverstonenet.com>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 11/05/2003 18:33:03
[[ moving from port-alpha to tech-net (and tech-kern) because this is
   definitely not architecture specific ]]

[ On Tuesday, November 4, 2003 at 18:06:55 (-0800), Artem Belevich wrote: ]
> Subject: Re: How can I help with hung vnlock()'ed clients?
>
> The problem usually happens on the box with Intel's i82559 NIC
> (if_fxp). The NIC in question occasionally gets stuck for about a
> minute, prints "fxp0: device timeout" on the console and starts
> working again. This may or may not have something to do with the
> problem.

Hey!  I was just going to post about that.  I was wondering if anyone
else was having problems with timeouts on fxp interfaces with i82559
chips.

I've been having now repeatable problems with the on-board fxp interface
of an Intel STL/2 (PIII) motherboard.

	fxp0 at pci0 dev 3 function 0: i82559 Ethernet, rev 8
	fxp0: interrupting at irq 10
	fxp0: Ethernet address 00:d0:b7:b6:ad:4b
	inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
	inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

I usually see the following first, sometimes several times in a row:

	fxp0: WARNING: SCB timed out!

Then sporadically inter-mixed will be some lines of:

	fxp0: device timeout

and finally when it gets "permanently" stuck (to all traffic, not just
NFS traffic), it'll say this and be silent from then until I reboot:

	fxp0 at line 2028: dmasync timeout
	fxp0 at line 1620: dmasync timeout

I've tried with and without the "link0" feature turned on.  (I can't
remember at the moment if the last two warnings print without link0 set.)

Of course a reboot doesn't work when an NFS mount is stuck either -- I
have to <Ctrl-Alt-ESC> into DDB and reboot manually again (or hit the
reset :-).


For me it happens regularly with even light NFS usage, and that can be
as easy as copying a single multi-megabyte file from the NFS server.

I'm pretty sure this is a recently introduced fxp(4) driver bug, not a
problem with the i82559, though I have not yet tested all the
combinations of media modes and I've not yet tested i82550 NICs.

I've got several of these machines in use with an older version of the
driver (circa 2001/06/24), and they've never ever complained of an SCB
timeout.  This particular machine ran as a very heavily used Squid
server for many months without fail.

However I've been testing it in 10baseT mode with 1.6.x (it was used in
100baseTX-FDX mode in production with the older driver).

I should probably try it in 10baseT-FDX mode as my switch does support
full duplex.

I was also need to find a long enough cross-over cable and try it at
100baseTX-FDX, though my switch only has a pair of 100baseTX ports (that
need x-over cables), and they're both in use so I'll have to move one of
the production systems down to 10baseT first in order to do that test.

If it still fails I should try the add-on fxp it also happens to have
too (in all three modes, I suppose):

	fxp1 at pci0 dev 7 function 0: i82550 Ethernet, rev 12
	fxp1: interrupting at irq 9
	fxp1: Ethernet address 00:02:b3:28:1e:ac
	inphy1 at fxp1 phy 1: i82555 10/100 media interface, rev. 4
	inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

I have a sneaking suspicion now though that the media mode is not at all
related to the problem (I was hoping it was), though perhaps the i82559
is what's buggy.....  What revision chips do you have problems with?

> Good news is that sometimes everything recovers if you do 'ls' on the
> filesystem that's mounted from the same place that locked file comes
> from. Chances are 50/50. The bad news is that if you're unlucky,
> you'll get another stuck process. :-(

It never recovers for long for me.  Usually within minutes the whole
interface is frozen solid and not even pingable.

There are two problems here.

One is the stuck NFS mount, and the other of course is the fxp bug that
triggers the NFS bug.  :-)

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>