Some colleagues at BBN have several Dell R610s, purchased fairly recently. They've been experiencing total hangs, from which they can recover only with the power button (hold 4s). ctrl-alt-esc works to get into DDB, but after the hang ctrl-alt-esc does nothing. The Dell boxes are pretty normal, with single SATA disks, 4 on-board bnx and a 4-port wm. The lockup happens with a netbsd-5 (RC3 I think) install cd for amd64 after doing an install and running from disk. They haven't tried i386. It's not exactly clear what triggers the hang, but it seems to be network traffic, with ping (sourcing and sinking) being worse than forwarding. A fairly reliable way to hose the machines is to hook up a cat5 between two of them, ifconfig some addresses, and ping -f across that. RTT is an impressive 40us, but a lockup usually happens within 20 minutes. Using a switch seems to make the hang less likely. So I wondered about a locking error triggered by tx complete interrupts arriving in the middle of processing the next received packet. I suggested using LOCKDEBUG (and DIAGNOSTIC and DEBUG). That runs ok until it hangs :-) Can one enter DDB if the big kernel lock is taken and not released? The machines were updated to the latest Dell BIOS; apparently there's a dell advisory about a xeon firmware bug that results in windows bluescreens. Other than the lockup the machines are acting fine. So I wonder if the machines are buggy, or if there's a locking bug. I have not seen any postings about trouble with this kind of lockup in NetBSD, and there are some posts of trouble with Linux on these Dell machines. If someone has two beefy machines with bnx or wm and has a few minutes to connect them with a cable, I'd be very curious to see what happens after ping -f for several hours. Has anyone else had similar trouble? Any clues of what to try?
Attachment:
pgpeakyYnb7cp.pgp
Description: PGP signature