Subject: Re: Strange network hang on Poweredge 860
To: Lars Friend <lfriend@mcci.com>
From: Chuck Swiger <cswiger@mac.com>
List: netbsd-help
Date: 09/10/2007 12:52:41
Hi, Lars--

On Sep 10, 2007, at 11:34 AM, Lars Friend wrote:
> Hello all,
>         I've been experiencing a very strange mode of failure which  
> has me
> scratching my head so I figured I'd ask here to see if anybody had  
> seen
> something like this before.
>
>         I have installed NetBSD 3.1 on a brand new Dell PowerEdge 860
> system (dual core P4 Xeon, 4GB ram, 2 SATA drives in software RAID  
> using
> raidframe raid1).
[ ... ]
>         So, we replaced the old system with our fancy new one, and  
> four hours
> into operation, things get weird.  The system is still running,  
> everything seems okay,
> nothing unexpected or unpleasant in syslog, but the NIC is kaput.   
> It sees link, seems to be
> okay, but it won't accept or make connections, pings, or any other  
> network traffic.
[ ... ]
>         Has anybody seen this before, or does anybody have a good  
> hunch about what I can do
> to duplicate the failure?  Once I can duplicate it "in captivity"  
> it will be easier to debug, and easier
> to correct, but I would love to be able to duplicate it without  
> putting it up live and letting it crash because
> that is not only a lot of work, but it inconveniences users who  
> need to use the system.

There were a number of problems with the Broadcom NICs in Dell  
machines reported on the FreeBSD lists, particularly in conjunction  
with heavy UDP traffic such as NFS using the default transport.  It  
seems like the NIC would get confused about the state of the transmit  
and receive buffers (some kind of refcounting problem?), and stop  
passing traffic entirely, which sounds similar to the problem you've  
reported.

There were also some initialization issues which tended to occur if  
the NIC needed to be reset/woken up after entering an ACPI sleep  
state, doing WOL, or similar.  One of their engineers, David  
Christensen <davidch@broadcom.com> has done work to fix them and to  
improve the diagnostic messages so that better information is  
reported when the adaptor gets confused.

You might find the threads here:

   http://lists.freebsd.org/pipermail/freebsd-net/2007-June/thread.html

...such as "Problems with BCE network adapter (Dell PE2950)" to  
contain some helpful info and code patches.  It seems like the  
OpenBSD folks have also implemented some fixes and workarounds for  
PHY bugs in the BCM 575x/578x chipsets, going by:

   http://leaf.dragonflybsd.org/mailarchive/commits/2007-05/ 
msg00036.html

Perhaps someone more familiar with the status of the BCM driver in  
NetBSD could offer more detailed information than I can, but at least  
you've got a starting point and the name of an Broadcom engineer who  
has worked on their BSD drivers.

Regards,
-- 
-Chuck

PS: I wouldn't swap in a RealTek NIC given a choice-- the newer NICs  
from them aren't bad, but the older ones seemed to be flaky as well;  
instead I'd try a Intel Fast EtherExpress Pro ("fxp" to me, I think  
NetBSD calls 'em "wm", though), or the DEC "tulip" 21x4x chips ("dc"  
or "de" probably?)....