Subject: port-i386/36432: port-i386-specific ioapic interrupt issues in 4.0_BETA2
To: None <port-i386-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: None <wileyc@rezrov.net>
List: netbsd-bugs
Date: 06/03/2007 02:05:00
>Number:         36432
>Category:       port-i386
>Synopsis:       ioapic stops routing interrupts after awhile
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-i386-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jun 03 02:05:00 +0000 2007
>Originator:     Christopher SEKIYA
>Release:        NetBSD 4.0_BETA2
>Organization:

>Environment:
	
	
System: NetBSD guncho.rezrov.net 4.0_BETA2 NetBSD 4.0_BETA2 (NFS-NOAPIC) #0: Sun Jun 3 09:43:24 JST 2007 root@inasa.rezrov.net:/usr/local/netbsd/4src/sys/arch/i386/compile/NFS-NOAPIC i386
Architecture: i386
Machine: i386

dmesg from working kernel:

Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 4.0_BETA2 (NFS-NOAPIC) #0: Sun Jun  3 09:43:24 JST 2007
        root@inasa.rezrov.net:/usr/local/netbsd/4src/sys/arch/i386/compile/NFS-N
OAPIC
total memory = 957 MB
avail memory = 937 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
BIOS32 rev. 0 found at 0xfa2a0
mainbus0 (root)
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: VIA C3 Nehemiah (686-class), 997.22 MHz, id 0x69a
cpu0: features 381ba3f<FPU,VME,DE,PSE,TSC,MSR,APIC,SEP,MTRR>
cpu0: features 381ba3f<PGE,CMOV,PAT,MMX>
cpu0: features 381ba3f<FXSR,SSE>
cpu0: "VIA Nehemiah"
cpu0: I-cache 64 KB 32B/line 2-way, D-cache 64 KB 32B/line 2-way
cpu0: L2 cache 64 KB 32B/line 8-way
cpu0: ITLB 128 4 KB entries 8-way
cpu0: DTLB 128 4 KB entries 8-way
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: not started
ioapic at mainbus0: not configured
acpi0 at mainbus0: Advanced Configuration and Power Interface
acpi0: using Intel ACPI CA subsystem version 20060217
acpi0: X/RSDT: OemId <CN400 ,AWRDACPI,42302e31>, AslId <AWRD,00000000>
acpi0: SCI interrupting at int 9
acpi0: fixed-feature power button present
timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
ACPI-Fast 24-bit timer
LNKA: ACPI: Found matching pin for 0.8.INTA at func 0: 11
LNKB: ACPI: Found matching pin for 0.9.INTA at func 0: 12
LNKC: ACPI: Found matching pin for 0.10.INTA at func 0: 10
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
acpibut0 at acpi0 (PNP0C0C): ACPI Power Button
PNP0C01 [System Board] at acpi0 not configured
PNP0A03 [PCI/PCI-X Host Bridge] at acpi0 not configured
PNP0C02 [Plug and Play motherboard register resources] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C02 [Plug and Play motherboard register resources] at acpi0 not configured
PNP0000 [AT Interrupt Controller] at acpi0 not configured
PNP0200 [AT DMA Controller] at acpi0 not configured
attimer0 at acpi0 (PNP0100): AT Timer
attimer0: io 0x40-0x43 irq 0
PNP0B00 [AT Real-Time Clock] at acpi0 not configured
PNP0800 [AT-style speaker sound] at acpi0 not configured
PNP0C04 [Math Coprocessor] at acpi0 not configured
PNP0501 [16550A-compatible COM port] at acpi0 not configured
PNP0C0B [ACPI Fan] at acpi0 not configured
acpitz0 at acpi0: ACPI Thermal Zone
acpitz0: unable to get polling interval; using default of 30.0s
acpitz0: active cooling level 0: 100.0C
acpitz0: critical 100.0C passive 53.0C
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: VIA Technologies product 0x0259 (rev. 0x00)
agp at pchb0 not configured
pchb1 at pci0 dev 0 function 1
pchb1: VIA Technologies product 0x1259 (rev. 0x00)
pchb2 at pci0 dev 0 function 2
pchb2: VIA Technologies product 0x2259 (rev. 0x00)
pchb3 at pci0 dev 0 function 3
pchb3: VIA Technologies product 0x3259 (rev. 0x00)
pchb4 at pci0 dev 0 function 4
pchb4: VIA Technologies product 0x4259 (rev. 0x00)
pchb5 at pci0 dev 0 function 7
pchb5: VIA Technologies product 0x7259 (rev. 0x00)
ppb0 at pci0 dev 1 function 0: VIA Technologies VT8377CE CPU-AGP Bridge (rev. 0x
00)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
VIA Technologies product 0x3118 (VGA display, revision 0x02) at pci1 dev 0 funct
ion 0 not configured
3ware 9000 series: (rev. 0x00)
twa0 at pci0 dev 8 function 0: 3ware Apache
twa0: interrupting at irq 11
twa0: AEN 0x0053: INFO: Need to do a capacity test: 
twa0: 12 ports, Firmware FE9X 2.08.00.009, BIOS BE9X 2.03.01.052
twa0: Monitor BL9X 2.02.00.001, PCB Rev 019 , Achip 3.20    , Pchip 1.50    
twa0: port 0: SAMSUNG SP2504C                          238475 MB
twa0: port 1: WDC WD2500JS-55NCB1                      238475 MB
twa0: port 2: Hitachi HDT725025VLA380                  238475 MB
twa0: port 3: ST3250620AS                              238475 MB
twa0: port 4: SAMSUNG SP2504C                          238475 MB
twa0: port 5: Maxtor 7L300S0                           286188 MB
twa0: AMCC    9500S-12   DISK 2.08T7D6VKGJE5477800294E
ld0 at twa0 unit 0
ld0: 232 GB, 30392 cyl, 255 head, 63 sec, 512 bytes/sect x 488259584 sectors
twa0: AMCC    9500S-12   DISK 2.08R10EET1KE575E900B019
ld1 at twa0 unit 1
ld1: 232 GB, 30392 cyl, 255 head, 63 sec, 512 bytes/sect x 488259584 sectors
twa0: AMCC    9500S-12   DISK 2.08L6073Z7G593655002213
ld2 at twa0 unit 2
ld2: 232 GB, 30392 cyl, 255 head, 63 sec, 512 bytes/sect x 488259584 sectors
fxp0 at pci0 dev 9 function 0: i82550 Ethernet, rev 16
fxp0: interrupting at irq 12
fxp0: Ethernet address 00:e0:81:55:cf:d4
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
vge0 at pci0 dev 10 function 0: VIA VT612X Gigabit Ethernet (rev. 0x11)
vge0: interrupting at irq 10
vge0: Ethernet address: 00:e0:81:55:cf:d2
ciphy0 at vge0 phy 1: Cicada CS8201 10/100/1000TX PHY, rev. 2
ciphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX
, auto
fxp0: Ethernet address 00:e0:81:55:cf:d4
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
vge0 at pci0 dev 10 function 0: VIA VT612X Gigabit Ethernet (rev. 0x11)
vge0: interrupting at irq 10
vge0: Ethernet address: 00:e0:81:55:cf:d2
ciphy0 at vge0 phy 1: Cicada CS8201 10/100/1000TX PHY, rev. 2
ciphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX
, auto
viapcib0 at pci0 dev 17 function 0
viapcib0: VIA Technologies VT8237 (Apollo KT600) PCI-ISA Bridge (rev. 0x00)
viapcib0: SMBus found at 0x500 (revision 0x0)
iic0 at viapcib0: I2C bus
isa0 at viapcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
npx0 at isa0 port 0xf0-0xff
npx0: reported by CPUID; using exception 16
timecounter: Timecounter "TSC" frequency 997234050 Hz quality 800
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
boot device: ld0
root on ld0a dumps on ld0b
root file system type: ffs

dmesg from broken kernel (GENERIC.MP minus extraneous things for debugging purposes):

Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006
The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California.  All rights reserved.

NetBSD 4.0_BETA2 (NFS) #0: Sat Jun  2 19:07:04 JST 2007
wileyc@inasa.rezrov.net:/usr/local/netbsd/4src/sys/arch/i386/compile/NFS
total memory = 957 MB
avail memory = 937 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
BIOS32 rev. 0 found at 0xfa2a0
mainbus0 (root)
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: VIA C3 Nehemiah (686-class), 997.22 MHz, id 0x69a
cpu0: features 381ba3f<FPU,VME,DE,PSE,TSC,MSR,APIC,SEP,MTRR>
cpu0: features 381ba3f<PGE,CMOV,PAT,MMX>
cpu0: features 381ba3f<FXSR,SSE>
cpu0: "VIA Nehemiah"
cpu0: I-cache 64 KB 32B/line 2-way, D-cache 64 KB 32B/line 2-way
cpu0: L2 cache 64 KB 32B/line 8-way
cpu0: ITLB 128 4 KB entries 8-way
cpu0: DTLB 128 4 KB entries 8-way
cpu0: calibrating local timer
cpu0: apic clock running at 132 MHz
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: starting
cpu1: VIA C3 Nehemiah (686-class), 997.16 MHz, id 0x69a
cpu1: features 381ba3f<FPU,VME,DE,PSE,TSC,MSR,APIC,SEP,MTRR>
cpu1: features 381ba3f<PGE,CMOV,PAT,MMX>
cpu1: features 381ba3f<FXSR,SSE>
cpu1: "VIA Nehemiah"
cpu1: I-cache 64 KB 32B/line 2-way, D-cache 64 KB 32B/line 2-way
cpu1: L2 cache 64 KB 32B/line 8-way
cpu1: ITLB 128 4 KB entries 8-way
cpu1: DTLB 128 4 KB entries 8-way
ioapic0 at mainbus0 apid 2 (I/O APIC)
ioapic0: pa 0xfec00000, version 3, 24 pins
ioapic0: misconfigured as apic 0
ioapic0: remapped to apic 2
acpi0 at mainbus0: Advanced Configuration and Power Interface
acpi0: using Intel ACPI CA subsystem version 20060217
acpi0: X/RSDT: OemId <CN400 ,AWRDACPI,42302e31>, AslId <AWRD,00000000>
acpi0: SCI interrupting at int 9
acpi0: fixed-feature power button present
timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
ACPI-Fast 24-bit timer
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
acpibut0 at acpi0 (PNP0C0C): ACPI Power Button
PNP0C01 [System Board] at acpi0 not configured
PNP0A03 [PCI/PCI-X Host Bridge] at acpi0 not configured
PNP0C02 [Plug and Play motherboard register resources] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C02 [Plug and Play motherboard register resources] at acpi0 not configured
PNP0000 [AT Interrupt Controller] at acpi0 not configured
PNP0200 [AT DMA Controller] at acpi0 not configured
attimer0 at acpi0 (PNP0100): AT Timer
attimer0: io 0x40-0x43 irq 0
PNP0B00 [AT Real-Time Clock] at acpi0 not configured
PNP0800 [AT-style speaker sound] at acpi0 not configured
PNP0C04 [Math Coprocessor] at acpi0 not configured
PNP0501 [16550A-compatible COM port] at acpi0 not configured
PNP0C0B [ACPI Fan] at acpi0 not configured
acpitz0 at acpi0: ACPI Thermal Zone
acpitz0: unable to get polling interval; using default of 30.0s
acpitz0: active cooling level 0: 100.0C
acpitz0: critical 100.0C passive 54.0C
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: VIA Technologies product 0x0259 (rev. 0x00)
agp at pchb0 not configured
pchb1 at pci0 dev 0 function 1
pchb1: VIA Technologies product 0x1259 (rev. 0x00)
pchb2 at pci0 dev 0 function 2
pchb2: VIA Technologies product 0x2259 (rev. 0x00)
pchb3 at pci0 dev 0 function 3
pchb3: VIA Technologies product 0x3259 (rev. 0x00)
pchb4 at pci0 dev 0 function 4
pchb4: VIA Technologies product 0x4259 (rev. 0x00)
pchb5 at pci0 dev 0 function 7
pchb5: VIA Technologies product 0x7259 (rev. 0x00)
ppb0 at pci0 dev 1 function 0: VIA Technologies VT8377CE CPU-AGP Bridge (rev. 0x
00)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
VIA Technologies product 0x3118 (VGA display, revision 0x02) at pci1 dev 0 funct
ion 0 not configured
3ware 9000 series: (rev. 0x00)
twa0 at pci0 dev 8 function 0: 3ware Apache
twa0: interrupting at ioapic0 pin 16 (irq 11)
twa0: AEN 0x0053: INFO: Need to do a capacity test: 
twa0: 12 ports, Firmware FE9X 2.08.00.009, BIOS BE9X 2.03.01.052
twa0: Monitor BL9X 2.02.00.001, PCB Rev 019 , Achip 3.20    , Pchip 1.50    
twa0: port 0: SAMSUNG SP2504C                          238475 MB
twa0: port 1: WDC WD2500JS-55NCB1                      238475 MB
twa0: port 2: Hitachi HDT725025VLA380                  238475 MB
twa0: port 3: ST3250620AS                              238475 MB
twa0: port 4: SAMSUNG SP2504C                          238475 MB
twa0: port 5: Maxtor 7L300S0                           286188 MB
twa0: AMCC    9500S-12   DISK 2.08T7D6VKGJE5477800294E
ld0 at twa0 unit 0
ld0: 232 GB, 30392 cyl, 255 head, 63 sec, 512 bytes/sect x 488259584 sectors
twa0: AMCC    9500S-12   DISK 2.08R10EET1KE575E900B019
ld1 at twa0 unit 1
ld1: 232 GB, 30392 cyl, 255 head, 63 sec, 512 bytes/sect x 488259584 sectors
twa0: AMCC    9500S-12   DISK 2.08L6073Z7G593655002213
ld2 at twa0 unit 2
ld2: 232 GB, 30392 cyl, 255 head, 63 sec, 512 bytes/sect x 488259584 sectors
fxp0 at pci0 dev 9 function 0: i82550 Ethernet, rev 16
fxp0: interrupting at ioapic0 pin 17 (irq 12)
fxp0: Ethernet address 00:e0:81:55:cf:d4
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
vge0 at pci0 dev 10 function 0: VIA VT612X Gigabit Ethernet (rev. 0x11)
vge0: interrupting at ioapic0 pin 18 (irq 10)
vge0: Ethernet address: 00:e0:81:55:cf:d2
ciphy0 at vge0 phy 1: Cicada CS8201 10/100/1000TX PHY, rev. 2
ciphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX
, auto
viapcib0 at pci0 dev 17 function 0
viapcib0: VIA Technologies VT8237 (Apollo KT600) PCI-ISA Bridge (rev. 0x00)
viapcib0: SMBus found at 0x500 (revision 0x0)
iic0 at viapcib0: I2C bus
isa0 at viapcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
npx0 at isa0 port 0xf0-0xff
npx0: reported by CPUID; using exception 16
ioapic0: enabling
timecounter: Timecounter "TSC" frequency 997230200 Hz quality 800
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
boot device: ld0
root on ld0a dumps on ld0b
root file system type: ffs
cpu1: CPU 1 running

>Description:

I've been seeing rather frustrating interrupt lossage under 4.0_BETA2 on one
of my machines, and I think that I've narrowed it down to an issue with
handling interrupts routed via ioapic.

This machine was originally an AMD64 MSI-K8MM3 with an off-board twa and re,
running GENERIC.MP and amd64 port.  Although this configuration was quite
stable, I needed to run the FreeBSD 3ware management tools and thus migrated
to the i386 port with the GENERIC.MP kernel.

Soon thereafter, the machine would hang during heavy disk i/o.  I replaced
the motherboard with a VIA VT-310DP (dual CPU, three onboard NICs -- vge, vr,
fxp).  Not only did the disk hangs continue, but I started noticing
intermittent network watchdog timeouts on all three interfaces.

Dropping to ddb worked.  bt showed nothing past the serial console interrupt
handler.  Reboot/sync spun forever.

Removing ioapic support from the kernel returned the machine to stability.

>How-To-Repeat:
Install port-i386 GENERIC.MP on a single- or dual-processor machine, preferably
with a twa card (mine has a BBU and six drives configured as three mirrors).

Subject the card to moderate write load ("tar czf - /usr/src/sys | ssh ${TARGET} tar xzvf - -C /" killed it every time for me) while watching the console.
The machine will hang sometime during the write and subsequent attempts to
access the filesystems will silently hang.  Console will respond to break.

It is also possible to shut down the network interfaces in the same manner
(interfaces will emit "watchdog timeouts" and will never recover), but I do
not have a reliable test case for this.
	
>Fix:
I commented out ioapic because I can live without the second processor on a
production fileserver, but that's probably suboptimal.
	

>Unformatted: