Subject: Strange network hang on Poweredge 860
To: None <netbsd-help@netbsd.org>
From: Lars Friend <lfriend@mcci.com>
List: netbsd-help
Date: 09/10/2007 14:34:02
Hello all,
         I've been experiencing a very strange mode of failure which has me
scratching my head so I figured I'd ask here to see if anybody had seen
something like this before.

         I have installed NetBSD 3.1 on a brand new Dell PowerEdge 860
system (dual core P4 Xeon, 4GB ram, 2 SATA drives in software RAID using
raidframe raid1).

         This system is in line to (once stable) replace an aging and slow box
to take over POP, SMTP, DHCP, and secure login services for a decent
sized pool of users.  I cloned the old system from backups (using restore),
put the GENERIC.MP kernel in place, and changed its hostname and IP.
I also turned of dhcpd (so as not to stomp the live server), and let 
it run for a few
weeks (logging in and using it from time to time, testing out patches and
doing general system stuff).  It was rock solid and very stable.

         So, we replaced the old system with our fancy new one, and four hours
into operation, things get weird.  The system is still running, 
everything seems okay,
nothing unexpected or unpleasant in syslog, but the NIC is kaput.  It 
sees link, seems to be
okay, but it won't accept or make connections, pings, or any other 
network traffic.

         On speculation, we tried again with the non-MP kernel (just 
the i386 GENERIC) and
it did it again, four hours into operation.  We added another NIC (a 
RealTek NIC re0) and
tried again using re0 as our primary NIC figuring different card, 
different driver, maybe it'll
work.  Nope.  Not only did it hang up, but after the network hung up, 
I tried to bring
bge0 up to see if _it_ could talk, but it seemed to be stuck 
too.  (It's worth noting that
they share an IRQ.  Not sure if this has anything to do with it).

         So we put the old system back up, and pulled the new one 
(the 860) back into
testing, but so far I have not been able to duplicate the 
failure.  My first shot was
to run stress and keep the system busy, but it passed that test with 
flying colors.
         Last night I ran it all night answering pointless login 
sessions (I made a script to
SSH in and execute a bunch of various representative user activities 
on several test user
accounts (stuff like reading the mail spool, sleeping, copying files, 
grepping logs,
forwarding ports via SSH, etc...) and let it run under about the same 
load, number of users,
etc... as our crash condition and it still has not crashed.

         There are a couple things I am not simulating at the moment:

         dhcpd, sendmail, and I'm not NFS mounting home directories 
with amd, but aside
from that it is pretty darn close to the real running configuration.

         Has anybody seen this before, or does anybody have a good 
hunch about what I can do
to duplicate the failure?  Once I can duplicate it "in captivity" it 
will be easier to debug, and easier
to correct, but I would love to be able to duplicate it without 
putting it up live and letting it crash because
that is not only a lot of work, but it inconveniences users who need 
to use the system.

         Thanks for any insights, I'm tearing my hair out =:-/

                 -Lars Friend

PS:

I have included the output of dmesg in case that sheds any light:

NetBSD 3.1 (GENERIC) #0: Tue Oct 31 04:27:07 UTC 2006
         builds@b0.netbsd.org:/home/builds/ab/netbsd-3-1-RELEASE/i386/200610302053Z-obj/home/builds/ab/netbsd-3-1-RELEASE/src/sys/arch/i386/compile/GENERIC
total memory = 3583 MB
avail memory = 3498 MB
BIOS32 rev. 0 found at 0xffe90
mainbus0 (root)
cpu0 at mainbus0: (uniprocessor)
cpu0: Intel Pentium Pro, II or III (686-class), 2400.18 MHz, id 0x6f6
cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 e3bd<SSE3,MONITOR,DS-CPL,VMX,EST,TM2,xTPR>
cpu0: "Intel(R) Xeon(R) CPU            3060  @ 2.40GHz"
cpu0: I-cache 32 KB 64B/line 8-way, D-cache 32 KB 64B/line 8-way
cpu0: running without thermal monitor!
cpu0: Enhanced SpeedStep disabled by BIOS
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: Intel product 0x2778 (rev. 0x00)
ppb0 at pci0 dev 1 function 0: Intel product 0x2779 (rev. 0x00)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled, rd/line, wr/inv ok
ppb1 at pci0 dev 28 function 0: Intel 82801GB/GR PCI Express Port #1 
(rev. 0x01)
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
ppb2 at pci2 dev 0 function 0: Intel product 0x032c (rev. 0x09)
pci3 at ppb2 bus 3
pci3: i/o space, memory space enabled, rd/line, wr/inv ok
re0 at pci3 dev 2 function 0: RealTek 8169S Single-chip Gigabit Ethernet
re0: interrupting at irq 5
re0: Ethernet address 00:14:6c:cb:68:dc
re0: using 256 tx descriptors
ukphy0 at re0 phy 7: Generic IEEE 802.3u media interface
ukphy0: RTL8169S/8110S 1000BASE-T media interface (OUI 0x00e04c, 
model 0x0011), rev. 0
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
ppb3 at pci0 dev 28 function 4: Intel 82801GB/GR PCI Express Port #5 
(rev. 0x01)
pci4 at ppb3 bus 4
pci4: i/o space, memory space enabled, rd/line, wr/inv ok
bge0 at pci4 dev 0 function 0: Broadcom BCM5721 Gigabit Ethernet
bge0: interrupting at irq 3
bge0: PCI-Express DMA setting 0x76180000, expected 0x76180000
bge0: ASIC BCM5751 A1 (0x4101), Ethernet address 00:19:b9:f7:47:a2
bge0: setting short Tx thresholds
brgphy0 at bge0 phy 1: BCM5750 1000BASE-T media interface, rev. 0
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
ppb4 at pci0 dev 28 function 5: Intel 82801GB/GR PCI Express Port #6 
(rev. 0x01)
pci5 at ppb4 bus 5
pci5: i/o space, memory space enabled, rd/line, wr/inv ok
bge1 at pci5 dev 0 function 0: Broadcom BCM5721 Gigabit Ethernet
bge1: interrupting at irq 11
bge1: PCI-Express DMA setting 0x76180000, expected 0x76180000
bge1: ASIC BCM5751 A1 (0x4101), Ethernet address 00:19:b9:f7:47:a3
bge1: setting short Tx thresholds
brgphy1 at bge1 phy 1: BCM5750 1000BASE-T media interface, rev. 0
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
1000baseT-FDX, auto
uhci0 at pci0 dev 29 function 0: Intel 82801GB/GR USB UHCI Controller 
(rev. 0x01)
uhci0: interrupting at irq 11
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1 at pci0 dev 29 function 1: Intel 82801GB/GR USB UHCI Controller 
(rev. 0x01)
uhci1: interrupting at irq 10
usb1 at uhci1: USB revision 1.0
uhub1 at usb1
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2 at pci0 dev 29 function 2: Intel 82801GB/GR USB UHCI Controller 
(rev. 0x01)
uhci2: interrupting at irq 6
usb2 at uhci2: USB revision 1.0
uhub2 at usb2
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
ehci0 at pci0 dev 29 function 7: Intel 82801GB/GR USB EHCI Controller 
(rev. 0x01)
ehci0: interrupting at irq 11
ehci0: BIOS has given up ownership
ehci0: EHCI version 1.0
ehci0: wrong number of companions (7 != 3)
ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2
usb3 at ehci0: USB revision 2.0
uhub3 at usb3
uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub3: single transaction translator
uhub3: 6 ports with 6 removable, self powered
ppb5 at pci0 dev 30 function 0: Intel 82801BA Hub-PCI Bridge (rev. 0xe1)
pci6 at ppb5 bus 6
pci6: i/o space, memory space enabled
vga1 at pci6 dev 5 function 0: ATI Technologies product 0x515e (rev. 0x02)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
pcib0 at pci0 dev 31 function 0
pcib0: Intel 82801GB/GR LPC Interface Bridge (rev. 0x01)
piixide0 at pci0 dev 31 function 1
piixide0: Intel 82801GB/GR IDE Controller (ICH7) (rev. 0x01)
piixide0: bus-master DMA support present
piixide0: primary channel configured to compatibility mode
piixide0: primary channel interrupting at irq 14
atabus0 at piixide0 channel 0
piixide0: secondary channel configured to compatibility mode
piixide0: secondary channel ignored (disabled)
piixide1 at pci0 dev 31 function 2
piixide1: Intel 82801GB/GR Serial ATA/Raid Controller (ICH7) (rev. 0x01)
piixide1: bus-master DMA support present
piixide1: primary channel configured to native-PCI mode
piixide1: using irq 11 for native-PCI interrupt
atabus1 at piixide1 channel 0
piixide1: secondary channel configured to native-PCI mode
atabus2 at piixide1 channel 1
Intel 82801GB/GR SMBus Controller (SMBus serial bus, revision 0x01) 
at pci0 dev 31 function 3 not configured
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff: using exception 16
isapnp0: no ISA Plug 'n Play devices found
Kernelized RAIDframe activated
atapibus0 at atabus0: 2 targets
cd0 at atapibus0 drive 0: <HL-DT-STCD-RW/DVD-ROM GCC-4244N, , B101> 
cdrom removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33)
cd0(piixide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA)
uhub4 at uhub3 port 3
uhub4: Cypress Semiconductor USB2 Hub, class 9/0, rev 2.00/0.0b, addr 2
uhub4: multiple transaction translators
uhub4: 4 ports with 4 removable, self powered
wd0 at atabus1 drive 0: <ST3500630NS>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd0(piixide1:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
wd1 at atabus2 drive 0: <ST3500630NS>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(piixide1:1:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
raid0: RAID Level 1
raid0: Components: /dev/wd0a /dev/wd1a
raid0: Total Sectors: 976772992 (476939 MB)
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
uhidev0 at uhub2 port 1 configuration 1 interface 0
uhidev0: Avocent Avocent USBIAC, rev 1.10/1.00, addr 2, iclass 3/1
ukbd0 at uhidev0
wskbd1 at ukbd0 mux 1
wskbd1: connecting to wsdisplay0
uhidev1 at uhub2 port 1 configuration 1 interface 1
uhidev1: Avocent Avocent USBIAC, rev 1.10/1.00, addr 2, iclass 3/1
uhidev1: 3 report ids
ums0 at uhidev1 reportid 1: 5 buttons and Z dir.
wsmouse0 at ums0 mux 0
uhid0 at uhidev1 reportid 2: input=2, output=0, feature=0
uhid1 at uhidev1 reportid 3: input=1, output=0, feature=0
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)