tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [Greg Troxel] Dell R610 lockups?



In article <rmi39u3sell.fsf%fnord.ir.bbn.com@localhost>,
Greg Troxel  <gdt%ir.bbn.com@localhost> wrote:
>-=-=-=-=-=-
>
>It's not clear if this is an amd64 issue or a network code issue, so
>pointing it out here.
>
>
>-=-=-=-=-=-
>-=-=-=-=-=-
>
>
>Some colleagues at BBN have several Dell R610s, purchased fairly
>recently.  They've been experiencing total hangs, from which they can
>recover only with the power button (hold 4s).  ctrl-alt-esc works to get
>into DDB, but after the hang ctrl-alt-esc does nothing.  The Dell boxes
>are pretty normal, with single SATA disks, 4 on-board bnx and a 4-port
>wm.
>
>The lockup happens with a netbsd-5 (RC3 I think) install cd for amd64
>after doing an install and running from disk.  They haven't tried i386.
>
>It's not exactly clear what triggers the hang, but it seems to be
>network traffic, with ping (sourcing and sinking) being worse than
>forwarding.  A fairly reliable way to hose the machines is to hook up a
>cat5 between two of them, ifconfig some addresses, and ping -f across
>that.  RTT is an impressive 40us, but a lockup usually happens within 20
>minutes.  Using a switch seems to make the hang less likely.  So I
>wondered about a locking error triggered by tx complete interrupts
>arriving in the middle of processing the next received packet.
>
>I suggested using LOCKDEBUG (and DIAGNOSTIC and DEBUG).  That runs ok
>until it hangs :-) Can one enter DDB if the big kernel lock is taken and
>not released?
>
>The machines were updated to the latest Dell BIOS; apparently there's a
>dell advisory about a xeon firmware bug that results in windows
>bluescreens.
>
>Other than the lockup the machines are acting fine.  So I wonder if the
>machines are buggy, or if there's a locking bug.
>
>I have not seen any postings about trouble with this kind of lockup in
>NetBSD, and there are some posts of trouble with Linux on these Dell
>machines.
>
>If someone has two beefy machines with bnx or wm and has a few minutes
>to connect them with a cable, I'd be very curious to see what happens
>after ping -f for several hours.
>
>Has anyone else had similar trouble?  Any clues of what to try?
>

Boot linux on both and try the same test.

christos



Home | Main Index | Thread Index | Old Index