NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK

To: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
Date: Fri, 09 May 2014 14:18:18 +0200

Hi again,

I've tried to analyse the stackframes etc. a little bit more - but now Ineed some help ...


After entering DDB I see the following:

db{0}> ps /l
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
666  >   1 7   0         0   fffffe822c504b20               ftpd
638      1 3   1        80   fffffe822e047280                csh ttyraw
646      1 3   1        80   fffffe822dae06e0             bounce kqueue
505      1 3   1        80   fffffe822dae02c0               smtp lockf
615      1 3   0        80   fffffe822c504700               smtp kqueue
632      1 3   0        80   fffffe822c605b60    trivial-rewrite kqueue
639      1 3   1        80   fffffe81330ce240            cleanup kqueue
690      1 3   1        80   fffffe81330cea80               qmgr kqueue
644      1 3   0        80   fffffe81330ce660             pickup kqueue
691      1 3   1        80   fffffe810f80f220              login wait
507      1 3   0        80   fffffe822c8c6300              inetd kqueue
628      1 3   1        80   fffffe822c8c6720             master kqueue
437      1 3   0        80   fffffe822e047ac0            choparp select
381      1 3   0        80   fffffe822c8c6b40             powerd kqueue
337      1 3   0        80   fffffe822df43260             logger pipe_rd
333      1 3   0        80   fffffe822df0b2a0                 sh wait
183  >   1 7   1         0   fffffe822df0bae0            syslogd
1        1 3   1        80   fffffe810f1d21a0               init wait
0       66 3   1       200   fffffe822e0476a0            physiod physiod
0       65 3   0       200   fffffe810f80f640           aiodoned aiodoned
0       64 3   1       200   fffffe810f506200            ioflush syncer
0       63 3   0       200   fffffe810f80fa60           pgdaemon pgdaemon
0       60 3   0       200   fffffe810f506620            raidio1 raidiow
0       59 3   0       200   fffffe810f506a40              raid1 rfnodeq
0       58 3   1       200   fffffe810f21c1e0            raidio0 raidiow
0       57 3   1       200   fffffe810f21c600              raid0 rfnodeq
0       56 3   1       200   fffffe810f10d160            swcrthr swcr_wait
0       55 3   1       200   fffffe810f1dda00          cryptoret crypto_w
0       53 3   1       200   fffffe810f10d9a0               usb6 usbevt
0       52 3   1       200   fffffe810f14e9c0               usb4 usbevt
0       51 3   1       200   fffffe810f0cc560               usb7 usbevt
0       50 3   1       200   fffffe810f10d580               usb2 usbevt
0       49 3   1       200   fffffe810f0cc140               usb5 usbevt
0       48 3   1       200   fffffe810f08a120               usb3 usbevt
0       47 3   1       200   fffffe810f0cc980               usb0 usbevt
0       46 3   1       200   fffffe810f21ca20         usbtask-dr usbtsk
0       45 3   1       200   fffffe810f1dd1c0         usbtask-hc usbtsk
0       44 3   1       200   fffffe810f1dd5e0               usb1 usbevt
0       42 3   1       200   fffffe810f1d25c0              unpgc unpgc
0       41 3   1       200   fffffe810f1d29e0        vmem_rehash vmem_rehash
0       40 3   0       200   fffffe810f14e180          coretemp1 coretemp1
0       39 3   1       200   fffffe810f14e5a0          coretemp0 coretemp0
0       30 3   1       200   fffffe810f08a540            atabus5 atath
0       29 3   0       200   fffffe810f08a960            atabus4 atath
0       28 3   0       200   fffffe810f01e100               iic0 iicintr
0       27 3   1       200   fffffe810f01e520            atabus3 atath
0       26 3   1       200   fffffe810f01e940             atabus2 atath
0       25 3   0       200   fffffe810ea580e0            atabus1 atath
0       24 3   0       200   fffffe810ea58500            atabus0 atath
0       23 3   0       200   fffffe810ea58920               apm0 apmev
0       22 3   1       200   fffffe810e9f30c0            xcall/1 xcall
0       21 1   1       200   fffffe810e9f34e0          softser/1
0       20 3   1       200   fffffe810e9f3900          softclk/1 tstile
0       19 1   1       200   fffffe810e9b50a0          softbio/1
0       18 1   1       200   fffffe810e9b54c0          softnet/1
0       17 1   1       201   fffffe810e9b58e0             idle/1
0       16 3   0       200   fffffe822e11b080             sysmon smtaskq
0       15 3   0       200   fffffe822e11b4a0         pmfsuspend pmfsuspend
0       14 3   0       200   fffffe822e11b8c0           pmfevent pmfevent
0       13 3   0       200   fffffe822ef24060         sopendfree sopendfr
0       12 3   0       200   fffffe822ef24480           nfssilly nfssilly
0       11 3   1       200   fffffe822ef248a0            cachegc cachegc
0       10 3   1       200   fffffe822f72a040              vrele vrele
0        9 3   1       200   fffffe822f72a460             vdrain vdrain
0        8 3   0       200   fffffe822f72a880          modunload mod_unld
0        7 3   0       200   fffffe822f736020            xcall/0 xcall
0    >   6 7   0       200   fffffe822f736440          softser/0
0        5 1   0       200   fffffe822f736860          softclk/0
0        4 1   0       200   fffffe822f73f000          softbio/0
0    >   3 7   0       200   fffffe822f73f420          softnet/0
0        2 1   0       201   fffffe822f73f840             idle/0
0        1 3   1       200   ffffffff80655480            swapper uvm

OK - I thing the kernel lwp 6 and 3 are most relevant, but first theother two processes marked with '>':


db{0}> bt/a fffffe822c504b20
trace: pid 666 lid 1 at 0xfffffe810f8a9b40
soo_write() at netbsd:soo_write+0x2c
dofilewrite() at netbsd:dofilewrite+0x7d
sys_write() at netbsd:sys_write+0x62
syscall() at netbsd:syscall+0xac

No surprise - I'm running an ftp to the system to trigger the problem.
ftpd is gooing to send some data.

db{0}> bt/a fffffe822df0bae0
trace: pid 183 lid 1 at 0xfffffe810f82abe0
kevent1() at netbsd:kevent1+0x5c0
sys___kevent50() at netbsd:sys___kevent50+0x33
syscall() at netbsd:syscall+0xac

Whatever syslogd is gooing to do - I do not think it is relevant.

I think it is woken by a message added by either my print statements orthe unknown "interrupt" routine (see below).


db{0}> bt/a fffffe822f73f420
trace: pid 0 lid 3 at 0xfffffe810e967760
ether_output() at netbsd:ether_output+0x2b6
ip_output() at netbsd:ip_output+0xa8f
tcp_output() at netbsd:tcp_output+0x1698
tcp_input() at netbsd:tcp_input+0x15d9
ip_input() at netbsd:ip_input+0x3ef
ipintr() at netbsd:ipintr+0x109
softint_dispatch() at netbsd:softint_dispatch+0xd9
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e967d70
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:

That is the part that is gooing to send a packet. I see the printout inip_output prior calling 'ifp->if_output()' - not the one behind.The location pointed to by the backtrace in ether_output() is the callto "return ifq_enqueue(...)". I also see the printout I've added infront of this call, but not the one behind.In ifq_enqueue() I see the output of the call to 'ifp->if_start' - thewm-driver - in this routine and the printout in front of the splx(s) atthe end of the routine - not the printout behind it.This is the localtion where the deadlock happens while processing otherinterrupts in Xspllower.

This always looks the same ....



db{0}> bt/a fffffe822f736440
trace: pid 0 lid 6 at 0xfffffe810e9739c8
breakpoint() at netbsd:breakpoint+0x5
comintr() at netbsd:comintr+0x518
Xintr_ioapic_edge1() at netbsd:Xintr_ioapic_edge1+0xea
--- interrupt ---
bus_space_read_4() at netbsd:bus_space_read_4+0xa
intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x3b
Xintr_ioapic_level6() at netbsd:Xintr_ioapic_level6+0xf2
--- interrupt ---
Xspllower() at netbsd:Xspllower+0xe
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e973d70
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:


Hmmm - not shure about it ...

It looks like that during processing one pending interrupt in Xspllowerat the end of that routine an interrupt came im that takes theKERNEL_LOCK in intr_biglock_wrapper() again and do what? Hangup inbus_space_read_4() ???? Busy-loop for whatever reason in that interruptand the location where the DDB-enter occures in bus_space_read_4() isjust random ????The comintr looks like the break-interrupt on the serial console of thesystem to enter DDB to me.

Any idea to find out what interrupt routine it is???

Where should I go on with debugging?

Accedently the trace-back is not 100% the same in the interrupt-partevery time. (But it happens always during the splx(s) in ifq_enqueue()and the other two processes also have the same backtrace.)

The system has frosen at this point. If I try to reboot it from DDB itgets lots of packet interrupts from the interfaces and freezes againafter all are processed. Break and reboot again will reboot the system.

But I think this information is not relevant.

Thanks in advance for help.


By the way:

The '/a' for the trace command is not documented in DDB manuals and theonline help of DDB !!!!!I've found it by a look into the sources while trying to get thebacktraces for the kernel-threads.

Perhaps that should be changed ...
Should I open a PR for this?

Best regards

W. Stukenbrock


Wolfgang Stukenbrock wrote:

The following reply was made to PR kern/48733; it has been noted by GNATS.

From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, 
netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Sun, 27 Apr 2014 10:21:03 +0200

 Hi, sorry for the delay, but I'm working off-side most time at the moment.

I have to look onto the system to validate the assumption, that the USBshares the interupt. (I remember that there are some interrupts shared.)

 But I can do this not earlyer as 02.05.2014 or later ...

And lowering the spl-level does not imply that the interrupt has to beshared with the wm-driver to run into this problem.The USB-interupt is not the only one I've seen.And I've analysed the stack backtrace that has dead-locked the systemand it locks up in the interrupt code while trying to get the KERNEL_LOCK!I will have a look at this again, but as far as I remember it definitlylocks up in the splx while trying to get the KERNEL_LOCK. splx does notreturn anymore and I've got the interrupt source by translating the jumpaddress to a symbol that has not been reached anymore.(I've still not commented out the LOCK/UNLOCK prior calling theoutput-routine to validate my analyses - no time till now, sorry - but Iwill do it as soon as possible.)I will send the results of my review as soon as I've be able to do it.best regardsW. StukenbrockManuel Bouyer wrote:> The following reply was made to PR kern/48733; it has been noted by GNATS.>> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>

 > To: gnats-bugs%NetBSD.org@localhost
 > Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
netbsd-bugs%NetBSD.org@localhost
 > Subject: Re: kern/48733: deadlock in if_output() with interrupt on 
KERNEL_LOCK
 > Date: Fri, 11 Apr 2014 17:59:09 +0200

>> On Fri, Apr 11, 2014 at 02:40:00PM +0000, Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:

 >  > >Description:
 >  >     Problem located in /src/sys/netinet/ip_output.c.
 >  >     Since file revision 1.208 the Kernel-Lock is locked prior calling 
if_output
 >  >     on the interface.
 >  >     Now - at least the wm-driver - will call splnet() and splx() inside 
the output
 >  >     routine.
 >  >     If any interrupt occurs in between splnet() and splx(), the interrupt 
is delayed and
 >  >     is processes in splx() when the level is released again.
 >  >     If such an interrupt is e.g. not MP-SAFE, the call stup in 
intr_biglock_wrapper() is
 >  >     used to call the interrupt routine and that one will lock the 
KERNEL-LOCK again.
 >  >     So we try to lock it again here -> dead-lock.

> >> > Our system runs fine with 4 8257x interfaces, but after adding 2 additional 8254x

 >  >     interfaces, the system lock-up after a short time. Don't ask me, why 
the if_output
 >  >     call takes "to long" with theese two additonal interfaces, but it is 
reproducable.
 >  >     I've analysed this several times with DDB. Most times I've seen an 
USB-interrupt
 >  >     that dead-lock the system.

>> I think your analsys is wrong. the KERNEL_LOCK is special in the sense that

 >  it can be locked multiple time on the same CPU. So it's not a problem
 >  that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
 >  increase the lock count. A splx() on another CPU will wait for the
 >  KERNEL_LOCK to be relased.

>> I think your problem is more likely in the USB stack.

 >  Maybe one of your new ethernet interface shares an interrupt with the
 >  USB controller ?

>>> > >How-To-Repeat:

 >  >     Run a lot of trafic over wm-interfaces and do shomething e.g. on USB 
at the same
 >  >     time. It is just a question of time till system-dead-lock.
 >  > >Fix:
 >  >     Fist guess: revert change done from 1.207 to 1.208.
 >  >     But I've no idea about side effects.

>> Very bad: the output queues are protected by the KERNEL_LOCK and splnet().

 >  If you revert ip_output 1.208, you'll also have to revert ip_input.c
 >  1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK 
again.

>> --> Manuel Bouyer <bouyer%antioche.eu.org@localhost>

 >       NetBSD: 26 ans d'experience feront toujours la difference
 >  --

>>>> Received: from DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) by

 >  DBXPR07MB319.eurprd07.prod.outlook.com (10.141.12.141) with Microsoft SMTP
 >  Server (TLS) id 15.0.918.8 via Mailbox Transport; Fri, 11 Apr 2014 16:00:11
 >  +0000
 > Received: from DBXPR07CA001.eurprd07.prod.outlook.com (10.255.191.159) by
 >  DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) with Microsoft SMTP
 >  Server (TLS) id 15.0.918.8; Fri, 11 Apr 2014 16:00:10 +0000
 > Received: from DB3FFO11FD011.protection.gbl (2a01:111:f400:7e04::177) by
 >  DBXPR07CA001.outlook.office365.com (2a01:111:e400:9800::31) with Microsoft
 >  SMTP Server (TLS) id 15.0.918.8 via Frontend Transport; Fri, 11 Apr 2014
 >  16:00:09 +0000
 > Received: from e002.nagler-company.com (212.185.86.227) by
 >  DB3FFO11FD011.mail.protection.outlook.com (10.47.216.167) with Microsoft 
SMTP
 >  Server (TLS) id 15.0.918.6 via Frontend Transport; Fri, 11 Apr 2014 16:00:08
 >  +0000
 > Received: from mollari.NetBSD.org (mollari.netbsd.org [149.20.53.80])
 >   by e002.nagler-company.com (8.14.7/8.14.7) with ESMTP id s3BG0378005931
 >   for <Wolfgang.Stukenbrock%nagler-company.com@localhost>; Fri, 11 Apr 2014 
18:00:06 +0200 (CEST)
 > Received: by mollari.NetBSD.org (Postfix, from userid 31008)
 >   id C23A5A5828; Fri, 11 Apr 2014 16:00:01 +0000 (UTC)
 > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 > To: <kern-bug-people%netbsd.org@localhost>, 
<gnats-admin%netbsd.org@localhost>,
 >   <netbsd-bugs%netbsd.org@localhost>, 
<Wolfgang.Stukenbrock%nagler-company.com@localhost>
 > Reply-To: <gnats-bugs%NetBSD.org@localhost>
 > Subject: Re: kern/48733: deadlock in if_output() with interrupt on 
KERNEL_LOCK
 > References: <pr-kern-48733%gnats.netbsd.org@localhost>
 >   <20140411131311.74AF4123B93%test-s0.nagler-company.com@localhost>
 > X-Gnats-Was-Stupid: no
 > CC:
 > Message-ID: <20140411160001.C23A5A5828%mollari.NetBSD.org@localhost>
 > Date: Fri, 11 Apr 2014 16:00:01 +0000
 > Return-Path: gnats%NetBSD.org@localhost
 > X-EOPAttributedMessage: 0
 > X-MS-Exchange-Organization-MessageDirectionality: Incoming
 > X-Forefront-Antispam-Report: 
CIP:212.185.86.227;CTRY:DE;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(979002)(6009001)(428001)(50944004)(51704005)(24454002)(199002)(189002)(16796002)(46386002)(74502001)(70486001)(50466002)(87836001)(46102001)(90896003)(77982001)(52956003)(54356999)(50986999)(76176999)(53806999)(81542001)(81342001)(74662001)(45336002)(43066001)(83072002)(85852003)(6806004)(19580395003)(83322001)(42186004)(44976005)(80976001)(19580405001)(76482001)(99396002)(4396001)(80022001)(33656001)(20776003)(47776003)(48376002)(2201001)(79102001)(92726001)(42882001)(90966001)(969003)(989001)(999001)(1009001)(1019001);DIR:INB;SFP:;SCL:1;SRVR:DBXPR07MB317;H:e002.nagler-company.com;FPR:BF74F31D.9C06D725.B1F32CB3.4CE95053.203B9;PTR:e002.nagler-company.com;A:1;MX:1;LANG:en;
 > Content-Type: text/plain
 > X-MS-Exchange-Organization-Network-Message-Id: 
ae18e11e-2fa8-44eb-8125-08d123b63ff9
 > X-MS-Exchange-Organization-AVStamp-Service: 1.0
 > Received-SPF: None (: NetBSD.org does not designate permitted sender hosts)
 > X-MS-Exchange-Organization-SCL: 1
 > X-MS-Exchange-Organization-AuthSource: DB3FFO11FD011.protection.gbl
 > X-MS-Exchange-Organization-AuthAs: Anonymous
 > MIME-Version: 1.0

>> The following reply was made to PR kern/48733; it has been noted by GNATS.>> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>

 > To: gnats-bugs%NetBSD.org@localhost
 > Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
netbsd-bugs%NetBSD.org@localhost
 > Subject: Re: kern/48733: deadlock in if_output() with interrupt on 
KERNEL_LOCK
 > Date: Fri, 11 Apr 2014 17:59:09 +0200

>> On Fri, Apr 11, 2014 at 02:40:00PM +0000, Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:

 >  > >Description:
 >  >     Problem located in /src/sys/netinet/ip_output.c.
 >  >     Since file revision 1.208 the Kernel-Lock is locked prior calling 
if_output
 >  >     on the interface.
 >  >     Now - at least the wm-driver - will call splnet() and splx() inside 
the output
 >  >     routine.
 >  >     If any interrupt occurs in between splnet() and splx(), the interrupt 
is delayed and
 >  >     is processes in splx() when the level is released again.
 >  >     If such an interrupt is e.g. not MP-SAFE, the call stup in 
intr_biglock_wrapper() is
 >  >     used to call the interrupt routine and that one will lock the 
KERNEL-LOCK again.
 >  >     So we try to lock it again here -> dead-lock.

> >> > Our system runs fine with 4 8257x interfaces, but after adding 2 additional 8254x

 >  >     interfaces, the system lock-up after a short time. Don't ask me, why 
the if_output
 >  >     call takes "to long" with theese two additonal interfaces, but it is 
reproducable.
 >  >     I've analysed this several times with DDB. Most times I've seen an 
USB-interrupt
 >  >     that dead-lock the system.

>> I think your analsys is wrong. the KERNEL_LOCK is special in the sense that

 >  it can be locked multiple time on the same CPU. So it's not a problem
 >  that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
 >  increase the lock count. A splx() on another CPU will wait for the
 >  KERNEL_LOCK to be relased.

>> I think your problem is more likely in the USB stack.

 >  Maybe one of your new ethernet interface shares an interrupt with the
 >  USB controller ?

>>> > >How-To-Repeat:

 >  >     Run a lot of trafic over wm-interfaces and do shomething e.g. on USB 
at the same
 >  >     time. It is just a question of time till system-dead-lock.
 >  > >Fix:
 >  >     Fist guess: revert change done from 1.207 to 1.208.
 >  >     But I've no idea about side effects.

>> Very bad: the output queues are protected by the KERNEL_LOCK and splnet().

 >  If you revert ip_output 1.208, you'll also have to revert ip_input.c
 >  1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK 
again.

>> --> Manuel Bouyer <bouyer%antioche.eu.org@localhost>

 >       NetBSD: 26 ans d'experience feront toujours la difference
 >  --

>>--Dr. Nagler & Company GmbH

 Hauptstraße 9
 92253 Schnaittenbach

Tel. +49 9622/71 97-42

 Fax +49 9622/71 97-50

Wolfgang.Stukenbrock%nagler-company.com@localhost

 http://www.nagler-company.com

Hauptsitz: Schnaittenbach

 Handelregister: Amberg HRB
 Gerichtsstand: Amberg
 Steuernummer: 201/118/51825
 USt.-ID-Nummer: DE 273143997
 Geschäftsführer: Dr. Martin Nagler



--


Dr. Nagler & Company GmbH
Hauptstraße 9
92253 Schnaittenbach

Tel. +49 9622/71 97-42
Fax +49 9622/71 97-50

Wolfgang.Stukenbrock%nagler-company.com@localhost
http://www.nagler-company.com


Hauptsitz: Schnaittenbach
Handelregister: Amberg HRB
Gerichtsstand: Amberg
Steuernummer: 201/118/51825
USt.-ID-Nummer: DE 273143997
Geschäftsführer: Dr. Martin Nagler

References:
- Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
  - From: Wolfgang Stukenbrock

Prev by Date: port-sparc/48795: msiiep_timer.c calibration broken?
Next by Date: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Previous by Thread: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Next by Thread: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Indexes:

Home | Main Index | Thread Index | Old Index