NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK



Hi again,

I've already tried LOCK_DEBUG - no change in behaviour and I also don't see additional usable information. (But now I'm nearly shure, it is not the KERNEL_LOCK - see below ...)

I've still not added the requested curcpu()->ci_biglock_count print, but I've searched for the irq 6 or something like this.
The "surprising" result:

s011a# grep irq /var/run/dmesg.boot
pckbc0 at acpi0 (KBC0, PNP0303) (kbd port): io 0x60,0x64 irq 1
pckbc1 at acpi0 (MSE0, PNP0F13) (aux port): irq 12
com0 at acpi0 (COM1, PNP0501-1): io 0x3f8-0x3ff irq 4
com1 at acpi0 (COM2, PNP0501-2): io 0x2f8-0x2ff irq 3
pckbc0: using irq 1 for kbd slot
s011a# grep interrupt !$
grep interrupt /var/run/dmesg.boot
acpi0: SCI interrupting at int 9
wm0: interrupting at ioapic1 pin 0
wm1: interrupting at ioapic1 pin 1
Intel product 0x0326 (interrupt system, interface 0x20, revision 0x09) at pci1 dev 0 function 1 not configured
uhci0: interrupting at ioapic0 pin 16
uhci1: interrupting at ioapic0 pin 17
uhci2: interrupting at ioapic0 pin 18
ehci0: interrupting at ioapic0 pin 18
wm2: interrupting at ioapic0 pin 16
wm3: interrupting at ioapic0 pin 17
wm4: interrupting at ioapic0 pin 18
wm5: interrupting at ioapic0 pin 19
uhci3: interrupting at ioapic0 pin 23
uhci4: interrupting at ioapic0 pin 22
uhci5: interrupting at ioapic0 pin 21
ehci1: interrupting at ioapic0 pin 23
pciide0: using ioapic0 pin 23 for native-PCI interrupt
piixide0: primary channel interrupting at ioapic0 pin 14
piixide0: secondary channel interrupting at ioapic0 pin 15
ichsmb0: interrupting at ioapic0 pin 17
piixide1: using ioapic0 pin 18 for native-PCI interrupt
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
s011a# grep ioapic !$
grep ioapic /var/run/dmesg.boot
ioapic0 at mainbus0 apid 2: pa 0xfec00000, version 20, 24 pins
ioapic1 at mainbus0 apid 3: pa 0xfecc0000, version 20, 24 pins
wm0: interrupting at ioapic1 pin 0
wm1: interrupting at ioapic1 pin 1
uhci0: interrupting at ioapic0 pin 16
uhci1: interrupting at ioapic0 pin 17
uhci2: interrupting at ioapic0 pin 18
ehci0: interrupting at ioapic0 pin 18
wm2: interrupting at ioapic0 pin 16
wm3: interrupting at ioapic0 pin 17
wm4: interrupting at ioapic0 pin 18
wm5: interrupting at ioapic0 pin 19
uhci3: interrupting at ioapic0 pin 23
uhci4: interrupting at ioapic0 pin 22
uhci5: interrupting at ioapic0 pin 21
ehci1: interrupting at ioapic0 pin 23
pciide0: using ioapic0 pin 23 for native-PCI interrupt
piixide0: primary channel interrupting at ioapic0 pin 14
piixide0: secondary channel interrupting at ioapic0 pin 15
ichsmb0: interrupting at ioapic0 pin 17
piixide1: using ioapic0 pin 18 for native-PCI interrupt




Hmmm - no irq6 or pin 6 !?!?!?!
The first greating line is in /var/run/dmesg.boot, so the result is "complete".
The board is reported as 'Supermicro X7SBi-LN4 (0123456789)'.
The two "additonal" interfaces (wm0 and wm1) are on ioapic1 - no other HW is on that APIC.



The search for not-configured devices say:
s011a# grep confi !$
grep confi /var/run/dmesg.boot
MBRD (PNP0C02) at acpi0 not configured
FWH (INT0800) at acpi0 not configured
pci0 at mainbus0 bus 0: configuration mode 1
Intel product 0x0326 (interrupt system, interface 0x20, revision 0x09) at pci1 dev 0 function 1 not configured
drm at vga0 not configured
ichlpcib0: TCO (watchdog) timer configured.
piixide0: primary channel configured to compatibility mode
piixide0: secondary channel configured to compatibility mode
Intel 82801I Thermal Controller (miscellaneous DASP, revision 0x02) at pci0 dev 31 function 6 not configured


Not really a lot - other systems have more unconfigured devices ....
Normaly I'm testing all available drivers on a "new" HW, so I'm nearly 100% shure that theese devices are not covered by any driver.

After that research it looks like there is something "suddenly" interrupting without a driver for it. The time it take till "freeze" varies, so a "normal timer timeout" does not sound reasonable. And why does this occure only after installing two additional interfaces (wm0 and wm1) that use the second APIC? The system does not show the problem without them.

pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0: Intel product 0x29f0 (rev. 0x01)
ppb0 at pci0 dev 1 function 0: Intel product 0x29f1 (rev. 0x01)
ppb0: PCI Express 2.0 <Root Port of PCI-E Root Complex>
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled, rd/line, wr/inv ok
ppb1 at pci1 dev 0 function 0: Intel product 0x032c (rev. 0x09)
ppb1: PCI Express 1.0 <PCI-E to PCI/PCI-X Bridge>
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
wm0 at pci2 dev 1 function 0: Intel i82546GB 1000BASE-T Ethernet (rev. 0x03)
wm0: interrupting at ioapic1 pin 0
wm0: 64-bit 133MHz PCIX bus
wm0: NVM dump:
1b00 0621 2a12 0430 ffff ffff ffff ffff
c414 2104 460b 1179 8086 1079 8086 34e8
000c 1079 0000 2102 10c8 ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff ffff
c30c ff63 5004 2102 00c8 ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff 0602
ffff ffff ffff ffff ffff ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff 6be8
wm0: 256 word (8 address bits) MicroWire EEPROM
wm0: Ethernet address 00:1b:21:06:12:2a
makphy0 at wm0 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
wm1 at pci2 dev 1 function 1: Intel i82546GB 1000BASE-T Ethernet (rev. 0x03)
wm1: interrupting at ioapic1 pin 1
wm1: 64-bit 133MHz PCIX bus
wm1: NVM dump:
1b00 0621 2a12 0430 ffff ffff ffff ffff
c414 2104 460b 1179 8086 1079 8086 34e8
000c 1079 0000 2102 10c8 ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff ffff
c30c ff63 5004 2102 00c8 ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff 0602
ffff ffff ffff ffff ffff ffff ffff ffff
ffff ffff ffff ffff ffff ffff ffff 6be8
wm1: 256 word (8 address bits) MicroWire EEPROM
wm1: Ethernet address 00:1b:21:06:12:2b
makphy1 at wm1 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
makphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto


Question: Are unused interrupts enabled? And if "yes", why?
I thought I've seen some code that will print a warning about this and tries to reset the interrupt, but I'm not 100% shure about the netbsd-version I've looked at.

Any hints to go on with debugging?
It would take me a lot of time to "understand" the interrupt setup on the amd64 architecture ...

best regards

W. Stukenbrock


Manuel Bouyer wrote:

The following reply was made to PR kern/48733; it has been noted by GNATS.

From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
netbsd-bugs%NetBSD.org@localhost,
        Wolfgang.Stukenbrock%nagler-company.com@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Mon, 12 May 2014 14:23:46 +0200

 On Fri, May 09, 2014 at 12:20:00PM +0000, Wolfgang Stukenbrock wrote:
 >  [...]
 >  db{0}> bt/a fffffe822f73f420
 >  trace: pid 0 lid 3 at 0xfffffe810e967760
 >  ether_output() at netbsd:ether_output+0x2b6
 >  ip_output() at netbsd:ip_output+0xa8f
 >  tcp_output() at netbsd:tcp_output+0x1698
 >  tcp_input() at netbsd:tcp_input+0x15d9
 >  ip_input() at netbsd:ip_input+0x3ef
 >  ipintr() at netbsd:ipintr+0x109
 >  softint_dispatch() at netbsd:softint_dispatch+0xd9
 >  DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e967d70
 >  Xsoftintr() at netbsd:Xsoftintr+0x4f
 >  --- interrupt ---
 >  0:
> > That is the part that is gooing to send a packet. I see the printout in > ip_output prior calling 'ifp->if_output()' - not the one behind. > The location pointed to by the backtrace in ether_output() is the call > to "return ifq_enqueue(...)". I also see the printout I've added in > front of this call, but not the one behind. > In ifq_enqueue() I see the output of the call to 'ifp->if_start' - the > wm-driver - in this routine and the printout in front of the splx(s) at > the end of the routine - not the printout behind it. > This is the localtion where the deadlock happens while processing other > interrupts in Xspllower.
 >  This always looks the same ....
ether_output() is called with the KERNEL_LOCK held, so at this point cpu0
 already owns KERNEL_LOCK, it won't spin trying to grab it again.
 You can confirm this by printing curcpu()->ci_biglock_count.
 Did you try a kernel with options LOCKDEBUG ?
What's possible here is a loop trying to process the same interrupt
 forever.
> > > > db{0}> bt/a fffffe822f736440
 >  trace: pid 0 lid 6 at 0xfffffe810e9739c8
 >  breakpoint() at netbsd:breakpoint+0x5
 >  comintr() at netbsd:comintr+0x518
 >  Xintr_ioapic_edge1() at netbsd:Xintr_ioapic_edge1+0xea
 >  --- interrupt ---
 >  bus_space_read_4() at netbsd:bus_space_read_4+0xa
 >  intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x3b
 >  Xintr_ioapic_level6() at netbsd:Xintr_ioapic_level6+0xf2
 >  --- interrupt ---
 >  Xspllower() at netbsd:Xspllower+0xe
 >  DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e973d70
 >  Xsoftintr() at netbsd:Xsoftintr+0x4f
 >  --- interrupt ---
 >  0:
> > > Hmmm - not shure about it ... > It looks like that during processing one pending interrupt in Xspllower > at the end of that routine an interrupt came im that takes the > KERNEL_LOCK in intr_biglock_wrapper() again and do what? Hangup in > bus_space_read_4() ???? Busy-loop for whatever reason in that interrupt > and the location where the DDB-enter occures in bus_space_read_4() is > just random ???? > The comintr looks like the break-interrupt on the serial console of the > system to enter DDB to me. it is. > Any idea to find out what interrupt routine it is??? dmesg could point to the problem; the interrupt we're looking for is
 level-triggered on pin 6 (so maybe "irq 6")
-- Manuel Bouyer <bouyer%antioche.eu.org@localhost>
      NetBSD: 26 ans d'experience feront toujours la difference
 --


--


Dr. Nagler & Company GmbH
Hauptstraße 9
92253 Schnaittenbach

Tel. +49 9622/71 97-42
Fax +49 9622/71 97-50

Wolfgang.Stukenbrock%nagler-company.com@localhost
http://www.nagler-company.com


Hauptsitz: Schnaittenbach
Handelregister: Amberg HRB
Gerichtsstand: Amberg
Steuernummer: 201/118/51825
USt.-ID-Nummer: DE 273143997
Geschäftsführer: Dr. Martin Nagler




Home | Main Index | Thread Index | Old Index