NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK



The following reply was made to PR kern/48733; it has been noted by GNATS.

From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, 
netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Fri, 16 May 2014 10:20:53 +0200

 Hi again,
 
 I've already tried LOCK_DEBUG - no change in behaviour and I also don't 
 see additional usable information. (But now I'm nearly shure, it is not 
 the KERNEL_LOCK - see below ...)
 
 I've still not added the requested curcpu()->ci_biglock_count print, but 
 I've searched for the irq 6 or something like this.
 The "surprising" result:
 
 s011a# grep irq /var/run/dmesg.boot
 pckbc0 at acpi0 (KBC0, PNP0303) (kbd port): io 0x60,0x64 irq 1
 pckbc1 at acpi0 (MSE0, PNP0F13) (aux port): irq 12
 com0 at acpi0 (COM1, PNP0501-1): io 0x3f8-0x3ff irq 4
 com1 at acpi0 (COM2, PNP0501-2): io 0x2f8-0x2ff irq 3
 pckbc0: using irq 1 for kbd slot
 s011a# grep interrupt !$
 grep interrupt /var/run/dmesg.boot
 acpi0: SCI interrupting at int 9
 wm0: interrupting at ioapic1 pin 0
 wm1: interrupting at ioapic1 pin 1
 Intel product 0x0326 (interrupt system, interface 0x20, revision 0x09) 
 at pci1 dev 0 function 1 not configured
 uhci0: interrupting at ioapic0 pin 16
 uhci1: interrupting at ioapic0 pin 17
 uhci2: interrupting at ioapic0 pin 18
 ehci0: interrupting at ioapic0 pin 18
 wm2: interrupting at ioapic0 pin 16
 wm3: interrupting at ioapic0 pin 17
 wm4: interrupting at ioapic0 pin 18
 wm5: interrupting at ioapic0 pin 19
 uhci3: interrupting at ioapic0 pin 23
 uhci4: interrupting at ioapic0 pin 22
 uhci5: interrupting at ioapic0 pin 21
 ehci1: interrupting at ioapic0 pin 23
 pciide0: using ioapic0 pin 23 for native-PCI interrupt
 piixide0: primary channel interrupting at ioapic0 pin 14
 piixide0: secondary channel interrupting at ioapic0 pin 15
 ichsmb0: interrupting at ioapic0 pin 17
 piixide1: using ioapic0 pin 18 for native-PCI interrupt
 timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
 s011a# grep ioapic !$
 grep ioapic /var/run/dmesg.boot
 ioapic0 at mainbus0 apid 2: pa 0xfec00000, version 20, 24 pins
 ioapic1 at mainbus0 apid 3: pa 0xfecc0000, version 20, 24 pins
 wm0: interrupting at ioapic1 pin 0
 wm1: interrupting at ioapic1 pin 1
 uhci0: interrupting at ioapic0 pin 16
 uhci1: interrupting at ioapic0 pin 17
 uhci2: interrupting at ioapic0 pin 18
 ehci0: interrupting at ioapic0 pin 18
 wm2: interrupting at ioapic0 pin 16
 wm3: interrupting at ioapic0 pin 17
 wm4: interrupting at ioapic0 pin 18
 wm5: interrupting at ioapic0 pin 19
 uhci3: interrupting at ioapic0 pin 23
 uhci4: interrupting at ioapic0 pin 22
 uhci5: interrupting at ioapic0 pin 21
 ehci1: interrupting at ioapic0 pin 23
 pciide0: using ioapic0 pin 23 for native-PCI interrupt
 piixide0: primary channel interrupting at ioapic0 pin 14
 piixide0: secondary channel interrupting at ioapic0 pin 15
 ichsmb0: interrupting at ioapic0 pin 17
 piixide1: using ioapic0 pin 18 for native-PCI interrupt
 
 
 
 
 Hmmm - no irq6 or pin 6 !?!?!?!
 The first greating line is in /var/run/dmesg.boot, so the result is 
 "complete".
 The board is reported as 'Supermicro X7SBi-LN4 (0123456789)'.
 The two "additonal" interfaces (wm0 and wm1) are on ioapic1 - no other 
 HW is on that APIC.
 
 
 
 The search for not-configured devices say:
 s011a# grep confi !$
 grep confi /var/run/dmesg.boot
 MBRD (PNP0C02) at acpi0 not configured
 FWH (INT0800) at acpi0 not configured
 pci0 at mainbus0 bus 0: configuration mode 1
 Intel product 0x0326 (interrupt system, interface 0x20, revision 0x09) 
 at pci1 dev 0 function 1 not configured
 drm at vga0 not configured
 ichlpcib0: TCO (watchdog) timer configured.
 piixide0: primary channel configured to compatibility mode
 piixide0: secondary channel configured to compatibility mode
 Intel 82801I Thermal Controller (miscellaneous DASP, revision 0x02) at 
 pci0 dev 31 function 6 not configured
 
 
 Not really a lot - other systems have more unconfigured devices ....
 Normaly I'm testing all available drivers on a "new" HW, so I'm nearly 
 100% shure that theese devices are not covered by any driver.
 
 After that research it looks like there is something "suddenly" 
 interrupting without a driver for it.
 The time it take till "freeze" varies, so a "normal timer timeout" does 
 not sound reasonable.
 And why does this occure only after installing two additional interfaces 
 (wm0 and wm1) that use the second APIC? The system does not show the 
 problem without them.
 
 pci0 at mainbus0 bus 0: configuration mode 1
 pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
 pchb0 at pci0 dev 0 function 0: Intel product 0x29f0 (rev. 0x01)
 ppb0 at pci0 dev 1 function 0: Intel product 0x29f1 (rev. 0x01)
 ppb0: PCI Express 2.0 <Root Port of PCI-E Root Complex>
 pci1 at ppb0 bus 1
 pci1: i/o space, memory space enabled, rd/line, wr/inv ok
 ppb1 at pci1 dev 0 function 0: Intel product 0x032c (rev. 0x09)
 ppb1: PCI Express 1.0 <PCI-E to PCI/PCI-X Bridge>
 pci2 at ppb1 bus 2
 pci2: i/o space, memory space enabled, rd/line, wr/inv ok
 wm0 at pci2 dev 1 function 0: Intel i82546GB 1000BASE-T Ethernet (rev. 0x03)
 wm0: interrupting at ioapic1 pin 0
 wm0: 64-bit 133MHz PCIX bus
 wm0: NVM dump:
 1b00 0621 2a12 0430 ffff ffff ffff ffff
 c414 2104 460b 1179 8086 1079 8086 34e8
 000c 1079 0000 2102 10c8 ffff ffff ffff
 ffff ffff ffff ffff ffff ffff ffff ffff
 c30c ff63 5004 2102 00c8 ffff ffff ffff
 ffff ffff ffff ffff ffff ffff ffff 0602
 ffff ffff ffff ffff ffff ffff ffff ffff
 ffff ffff ffff ffff ffff ffff ffff 6be8
 wm0: 256 word (8 address bits) MicroWire EEPROM
 wm0: Ethernet address 00:1b:21:06:12:2a
 makphy0 at wm0 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
 makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
 1000baseT-FDX, auto
 wm1 at pci2 dev 1 function 1: Intel i82546GB 1000BASE-T Ethernet (rev. 0x03)
 wm1: interrupting at ioapic1 pin 1
 wm1: 64-bit 133MHz PCIX bus
 wm1: NVM dump:
 1b00 0621 2a12 0430 ffff ffff ffff ffff
 c414 2104 460b 1179 8086 1079 8086 34e8
 000c 1079 0000 2102 10c8 ffff ffff ffff
 ffff ffff ffff ffff ffff ffff ffff ffff
 c30c ff63 5004 2102 00c8 ffff ffff ffff
 ffff ffff ffff ffff ffff ffff ffff 0602
 ffff ffff ffff ffff ffff ffff ffff ffff
 ffff ffff ffff ffff ffff ffff ffff 6be8
 wm1: 256 word (8 address bits) MicroWire EEPROM
 wm1: Ethernet address 00:1b:21:06:12:2b
 makphy1 at wm1 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
 makphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 
 1000baseT-FDX, auto
 
 
 Question: Are unused interrupts enabled? And if "yes", why?
 I thought I've seen some code that will print a warning about this and 
 tries to reset the interrupt, but I'm not 100% shure about the 
 netbsd-version I've looked at.
 
 Any hints to go on with debugging?
 It would take me a lot of time to "understand" the interrupt setup on 
 the amd64 architecture ...
 
 best regards
 
 W. Stukenbrock
 
 
 Manuel Bouyer wrote:
 
 > The following reply was made to PR kern/48733; it has been noted by GNATS.
 > 
 > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 > To: gnats-bugs%NetBSD.org@localhost
 > Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
 > netbsd-bugs%NetBSD.org@localhost,
 >         Wolfgang.Stukenbrock%nagler-company.com@localhost
 > Subject: Re: kern/48733: deadlock in if_output() with interrupt on 
 > KERNEL_LOCK
 > Date: Mon, 12 May 2014 14:23:46 +0200
 > 
 >  On Fri, May 09, 2014 at 12:20:00PM +0000, Wolfgang Stukenbrock wrote:
 >  >  [...]
 >  >  db{0}> bt/a fffffe822f73f420
 >  >  trace: pid 0 lid 3 at 0xfffffe810e967760
 >  >  ether_output() at netbsd:ether_output+0x2b6
 >  >  ip_output() at netbsd:ip_output+0xa8f
 >  >  tcp_output() at netbsd:tcp_output+0x1698
 >  >  tcp_input() at netbsd:tcp_input+0x15d9
 >  >  ip_input() at netbsd:ip_input+0x3ef
 >  >  ipintr() at netbsd:ipintr+0x109
 >  >  softint_dispatch() at netbsd:softint_dispatch+0xd9
 >  >  DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e967d70
 >  >  Xsoftintr() at netbsd:Xsoftintr+0x4f
 >  >  --- interrupt ---
 >  >  0:
 >  >  
 >  >  That is the part that is gooing to send a packet. I see the printout in 
 >  >  ip_output prior calling 'ifp->if_output()' - not the one behind.
 >  >  The location pointed to by the backtrace in ether_output() is the call 
 >  >  to "return ifq_enqueue(...)". I also see the printout I've added in 
 >  >  front of this call, but not the one behind.
 >  >  In ifq_enqueue() I see the output of the call to 'ifp->if_start' - the 
 >  >  wm-driver - in this routine and the printout in front of the splx(s) at 
 >  >  the end of the routine - not the printout behind it.
 >  >  This is the localtion where the deadlock happens while processing other 
 >  >  interrupts in Xspllower.
 >  >  This always looks the same ....
 >  
 >  ether_output() is called with the KERNEL_LOCK held, so at this point cpu0
 >  already owns KERNEL_LOCK, it won't spin trying to grab it again.
 >  You can confirm this by printing curcpu()->ci_biglock_count.
 >  Did you try a kernel with options LOCKDEBUG ?
 >  
 >  What's possible here is a loop trying to process the same interrupt
 >  forever.
 >  
 >  >  
 >  >  
 >  >  
 >  >  db{0}> bt/a fffffe822f736440
 >  >  trace: pid 0 lid 6 at 0xfffffe810e9739c8
 >  >  breakpoint() at netbsd:breakpoint+0x5
 >  >  comintr() at netbsd:comintr+0x518
 >  >  Xintr_ioapic_edge1() at netbsd:Xintr_ioapic_edge1+0xea
 >  >  --- interrupt ---
 >  >  bus_space_read_4() at netbsd:bus_space_read_4+0xa
 >  >  intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x3b
 >  >  Xintr_ioapic_level6() at netbsd:Xintr_ioapic_level6+0xf2
 >  >  --- interrupt ---
 >  >  Xspllower() at netbsd:Xspllower+0xe
 >  >  DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e973d70
 >  >  Xsoftintr() at netbsd:Xsoftintr+0x4f
 >  >  --- interrupt ---
 >  >  0:
 >  >  
 >  >  
 >  >  Hmmm - not shure about it ...
 >  >  It looks like that during processing one pending interrupt in Xspllower 
 >  >  at the end of that routine an interrupt came im that takes the 
 >  >  KERNEL_LOCK in intr_biglock_wrapper() again and do what? Hangup in 
 >  >  bus_space_read_4() ???? Busy-loop for whatever reason in that interrupt 
 >  >  and the location where the DDB-enter occures in bus_space_read_4() is 
 >  >  just random ????
 >  >  The comintr looks like the break-interrupt on the serial console of the 
 >  >  system to enter DDB to me.
 >  
 >  it is.
 >  
 >  >  Any idea to find out what interrupt routine it is???
 >  
 >  dmesg could point to the problem; the interrupt we're looking for is
 >  level-triggered on pin 6 (so maybe "irq 6")
 >  
 >  -- 
 >  Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 >       NetBSD: 26 ans d'experience feront toujours la difference
 >  --
 >  
 > 
 
 
 -- 
 
 
 Dr. Nagler & Company GmbH
 Hauptstraße 9
 92253 Schnaittenbach
 
 Tel. +49 9622/71 97-42
 Fax +49 9622/71 97-50
 
 Wolfgang.Stukenbrock%nagler-company.com@localhost
 http://www.nagler-company.com
 
 
 Hauptsitz: Schnaittenbach
 Handelregister: Amberg HRB
 Gerichtsstand: Amberg
 Steuernummer: 201/118/51825
 USt.-ID-Nummer: DE 273143997
 Geschäftsführer: Dr. Martin Nagler
 
 


Home | Main Index | Thread Index | Old Index