NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
The following reply was made to PR kern/48733; it has been noted by GNATS.
From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Fri, 09 May 2014 14:18:18 +0200
Hi again,
I've tried to analyse the stackframes etc. a little bit more - but now I
need some help ...
After entering DDB I see the following:
db{0}> ps /l
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
666 > 1 7 0 0 fffffe822c504b20 ftpd
638 1 3 1 80 fffffe822e047280 csh ttyraw
646 1 3 1 80 fffffe822dae06e0 bounce kqueue
505 1 3 1 80 fffffe822dae02c0 smtp lockf
615 1 3 0 80 fffffe822c504700 smtp kqueue
632 1 3 0 80 fffffe822c605b60 trivial-rewrite kqueue
639 1 3 1 80 fffffe81330ce240 cleanup kqueue
690 1 3 1 80 fffffe81330cea80 qmgr kqueue
644 1 3 0 80 fffffe81330ce660 pickup kqueue
691 1 3 1 80 fffffe810f80f220 login wait
507 1 3 0 80 fffffe822c8c6300 inetd kqueue
628 1 3 1 80 fffffe822c8c6720 master kqueue
437 1 3 0 80 fffffe822e047ac0 choparp select
381 1 3 0 80 fffffe822c8c6b40 powerd kqueue
337 1 3 0 80 fffffe822df43260 logger pipe_rd
333 1 3 0 80 fffffe822df0b2a0 sh wait
183 > 1 7 1 0 fffffe822df0bae0 syslogd
1 1 3 1 80 fffffe810f1d21a0 init wait
0 66 3 1 200 fffffe822e0476a0 physiod physiod
0 65 3 0 200 fffffe810f80f640 aiodoned aiodoned
0 64 3 1 200 fffffe810f506200 ioflush syncer
0 63 3 0 200 fffffe810f80fa60 pgdaemon pgdaemon
0 60 3 0 200 fffffe810f506620 raidio1 raidiow
0 59 3 0 200 fffffe810f506a40 raid1 rfnodeq
0 58 3 1 200 fffffe810f21c1e0 raidio0 raidiow
0 57 3 1 200 fffffe810f21c600 raid0 rfnodeq
0 56 3 1 200 fffffe810f10d160 swcrthr swcr_wait
0 55 3 1 200 fffffe810f1dda00 cryptoret crypto_w
0 53 3 1 200 fffffe810f10d9a0 usb6 usbevt
0 52 3 1 200 fffffe810f14e9c0 usb4 usbevt
0 51 3 1 200 fffffe810f0cc560 usb7 usbevt
0 50 3 1 200 fffffe810f10d580 usb2 usbevt
0 49 3 1 200 fffffe810f0cc140 usb5 usbevt
0 48 3 1 200 fffffe810f08a120 usb3 usbevt
0 47 3 1 200 fffffe810f0cc980 usb0 usbevt
0 46 3 1 200 fffffe810f21ca20 usbtask-dr usbtsk
0 45 3 1 200 fffffe810f1dd1c0 usbtask-hc usbtsk
0 44 3 1 200 fffffe810f1dd5e0 usb1 usbevt
0 42 3 1 200 fffffe810f1d25c0 unpgc unpgc
0 41 3 1 200 fffffe810f1d29e0 vmem_rehash vmem_rehash
0 40 3 0 200 fffffe810f14e180 coretemp1 coretemp1
0 39 3 1 200 fffffe810f14e5a0 coretemp0 coretemp0
0 30 3 1 200 fffffe810f08a540 atabus5 atath
0 29 3 0 200 fffffe810f08a960 atabus4 atath
0 28 3 0 200 fffffe810f01e100 iic0 iicintr
0 27 3 1 200 fffffe810f01e520 atabus3 atath
0 26 3 1 200 fffffe810f01e940 atabus2 atath
0 25 3 0 200 fffffe810ea580e0 atabus1 atath
0 24 3 0 200 fffffe810ea58500 atabus0 atath
0 23 3 0 200 fffffe810ea58920 apm0 apmev
0 22 3 1 200 fffffe810e9f30c0 xcall/1 xcall
0 21 1 1 200 fffffe810e9f34e0 softser/1
0 20 3 1 200 fffffe810e9f3900 softclk/1 tstile
0 19 1 1 200 fffffe810e9b50a0 softbio/1
0 18 1 1 200 fffffe810e9b54c0 softnet/1
0 17 1 1 201 fffffe810e9b58e0 idle/1
0 16 3 0 200 fffffe822e11b080 sysmon smtaskq
0 15 3 0 200 fffffe822e11b4a0 pmfsuspend pmfsuspend
0 14 3 0 200 fffffe822e11b8c0 pmfevent pmfevent
0 13 3 0 200 fffffe822ef24060 sopendfree sopendfr
0 12 3 0 200 fffffe822ef24480 nfssilly nfssilly
0 11 3 1 200 fffffe822ef248a0 cachegc cachegc
0 10 3 1 200 fffffe822f72a040 vrele vrele
0 9 3 1 200 fffffe822f72a460 vdrain vdrain
0 8 3 0 200 fffffe822f72a880 modunload mod_unld
0 7 3 0 200 fffffe822f736020 xcall/0 xcall
0 > 6 7 0 200 fffffe822f736440 softser/0
0 5 1 0 200 fffffe822f736860 softclk/0
0 4 1 0 200 fffffe822f73f000 softbio/0
0 > 3 7 0 200 fffffe822f73f420 softnet/0
0 2 1 0 201 fffffe822f73f840 idle/0
0 1 3 1 200 ffffffff80655480 swapper uvm
OK - I thing the kernel lwp 6 and 3 are most relevant, but first the
other two processes marked with '>':
db{0}> bt/a fffffe822c504b20
trace: pid 666 lid 1 at 0xfffffe810f8a9b40
soo_write() at netbsd:soo_write+0x2c
dofilewrite() at netbsd:dofilewrite+0x7d
sys_write() at netbsd:sys_write+0x62
syscall() at netbsd:syscall+0xac
No surprise - I'm running an ftp to the system to trigger the problem.
ftpd is gooing to send some data.
db{0}> bt/a fffffe822df0bae0
trace: pid 183 lid 1 at 0xfffffe810f82abe0
kevent1() at netbsd:kevent1+0x5c0
sys___kevent50() at netbsd:sys___kevent50+0x33
syscall() at netbsd:syscall+0xac
Whatever syslogd is gooing to do - I do not think it is relevant.
I think it is woken by a message added by either my print statements or
the unknown "interrupt" routine (see below).
db{0}> bt/a fffffe822f73f420
trace: pid 0 lid 3 at 0xfffffe810e967760
ether_output() at netbsd:ether_output+0x2b6
ip_output() at netbsd:ip_output+0xa8f
tcp_output() at netbsd:tcp_output+0x1698
tcp_input() at netbsd:tcp_input+0x15d9
ip_input() at netbsd:ip_input+0x3ef
ipintr() at netbsd:ipintr+0x109
softint_dispatch() at netbsd:softint_dispatch+0xd9
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e967d70
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
That is the part that is gooing to send a packet. I see the printout in
ip_output prior calling 'ifp->if_output()' - not the one behind.
The location pointed to by the backtrace in ether_output() is the call
to "return ifq_enqueue(...)". I also see the printout I've added in
front of this call, but not the one behind.
In ifq_enqueue() I see the output of the call to 'ifp->if_start' - the
wm-driver - in this routine and the printout in front of the splx(s) at
the end of the routine - not the printout behind it.
This is the localtion where the deadlock happens while processing other
interrupts in Xspllower.
This always looks the same ....
db{0}> bt/a fffffe822f736440
trace: pid 0 lid 6 at 0xfffffe810e9739c8
breakpoint() at netbsd:breakpoint+0x5
comintr() at netbsd:comintr+0x518
Xintr_ioapic_edge1() at netbsd:Xintr_ioapic_edge1+0xea
--- interrupt ---
bus_space_read_4() at netbsd:bus_space_read_4+0xa
intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x3b
Xintr_ioapic_level6() at netbsd:Xintr_ioapic_level6+0xf2
--- interrupt ---
Xspllower() at netbsd:Xspllower+0xe
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e973d70
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
Hmmm - not shure about it ...
It looks like that during processing one pending interrupt in Xspllower
at the end of that routine an interrupt came im that takes the
KERNEL_LOCK in intr_biglock_wrapper() again and do what? Hangup in
bus_space_read_4() ???? Busy-loop for whatever reason in that interrupt
and the location where the DDB-enter occures in bus_space_read_4() is
just random ????
The comintr looks like the break-interrupt on the serial console of the
system to enter DDB to me.
Any idea to find out what interrupt routine it is???
Where should I go on with debugging?
Accedently the trace-back is not 100% the same in the interrupt-part
every time. (But it happens always during the splx(s) in ifq_enqueue()
and the other two processes also have the same backtrace.)
The system has frosen at this point. If I try to reboot it from DDB it
gets lots of packet interrupts from the interfaces and freezes again
after all are processed. Break and reboot again will reboot the system.
But I think this information is not relevant.
Thanks in advance for help.
By the way:
The '/a' for the trace command is not documented in DDB manuals and the
online help of DDB !!!!!
I've found it by a look into the sources while trying to get the
backtraces for the kernel-threads.
Perhaps that should be changed ...
Should I open a PR for this?
Best regards
W. Stukenbrock
Wolfgang Stukenbrock wrote:
> The following reply was made to PR kern/48733; it has been noted by GNATS.
>
> From: Wolfgang Stukenbrock
> <wolfgang.stukenbrock%nagler-company.com@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
> netbsd-bugs%netbsd.org@localhost
> Subject: Re: kern/48733: deadlock in if_output() with interrupt on
> KERNEL_LOCK
> Date: Sun, 27 Apr 2014 10:21:03 +0200
>
> Hi, sorry for the delay, but I'm working off-side most time at the moment.
>
> I have to look onto the system to validate the assumption, that the USB
> shares the interupt. (I remember that there are some interrupts shared.)
> But I can do this not earlyer as 02.05.2014 or later ...
> And lowering the spl-level does not imply that the interrupt has to be
> shared with the wm-driver to run into this problem.
>
> The USB-interupt is not the only one I've seen.
> And I've analysed the stack backtrace that has dead-locked the system
> and it locks up in the interrupt code while trying to get the KERNEL_LOCK!
> I will have a look at this again, but as far as I remember it definitly
> locks up in the splx while trying to get the KERNEL_LOCK. splx does not
> return anymore and I've got the interrupt source by translating the jump
> address to a symbol that has not been reached anymore.
> (I've still not commented out the LOCK/UNLOCK prior calling the
> output-routine to validate my analyses - no time till now, sorry - but I
> will do it as soon as possible.)
>
> I will send the results of my review as soon as I've be able to do it.
>
> best regards
>
> W. Stukenbrock
>
> Manuel Bouyer wrote:
>
> > The following reply was made to PR kern/48733; it has been noted by GNATS.
> >
> > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> > To: gnats-bugs%NetBSD.org@localhost
> > Cc: kern-bug-people%NetBSD.org@localhost,
> gnats-admin%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost
> > Subject: Re: kern/48733: deadlock in if_output() with interrupt on
> KERNEL_LOCK
> > Date: Fri, 11 Apr 2014 17:59:09 +0200
> >
> > On Fri, Apr 11, 2014 at 02:40:00PM +0000,
> Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
> > > >Description:
> > > Problem located in /src/sys/netinet/ip_output.c.
> > > Since file revision 1.208 the Kernel-Lock is locked prior
> calling if_output
> > > on the interface.
> > > Now - at least the wm-driver - will call splnet() and splx()
> inside the output
> > > routine.
> > > If any interrupt occurs in between splnet() and splx(), the
> interrupt is delayed and
> > > is processes in splx() when the level is released again.
> > > If such an interrupt is e.g. not MP-SAFE, the call stup in
> intr_biglock_wrapper() is
> > > used to call the interrupt routine and that one will lock the
> KERNEL-LOCK again.
> > > So we try to lock it again here -> dead-lock.
> > >
> > > Our system runs fine with 4 8257x interfaces, but after adding
> 2 additional 8254x
> > > interfaces, the system lock-up after a short time. Don't ask
> me, why the if_output
> > > call takes "to long" with theese two additonal interfaces, but
> it is reproducable.
> > > I've analysed this several times with DDB. Most times I've seen
> an USB-interrupt
> > > that dead-lock the system.
> >
> > I think your analsys is wrong. the KERNEL_LOCK is special in the sense
> that
> > it can be locked multiple time on the same CPU. So it's not a problem
> > that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
> > increase the lock count. A splx() on another CPU will wait for the
> > KERNEL_LOCK to be relased.
> >
> > I think your problem is more likely in the USB stack.
> > Maybe one of your new ethernet interface shares an interrupt with the
> > USB controller ?
> >
> >
> > > >How-To-Repeat:
> > > Run a lot of trafic over wm-interfaces and do shomething e.g.
> on USB at the same
> > > time. It is just a question of time till system-dead-lock.
> > > >Fix:
> > > Fist guess: revert change done from 1.207 to 1.208.
> > > But I've no idea about side effects.
> >
> > Very bad: the output queues are protected by the KERNEL_LOCK and
> splnet().
> > If you revert ip_output 1.208, you'll also have to revert ip_input.c
> > 1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK
> again.
> >
> > --
> > Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> > NetBSD: 26 ans d'experience feront toujours la difference
> > --
> >
> >
> >
> > Received: from DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) by
> > DBXPR07MB319.eurprd07.prod.outlook.com (10.141.12.141) with Microsoft
> SMTP
> > Server (TLS) id 15.0.918.8 via Mailbox Transport; Fri, 11 Apr 2014
> 16:00:11
> > +0000
> > Received: from DBXPR07CA001.eurprd07.prod.outlook.com (10.255.191.159) by
> > DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) with Microsoft
> SMTP
> > Server (TLS) id 15.0.918.8; Fri, 11 Apr 2014 16:00:10 +0000
> > Received: from DB3FFO11FD011.protection.gbl (2a01:111:f400:7e04::177) by
> > DBXPR07CA001.outlook.office365.com (2a01:111:e400:9800::31) with
> Microsoft
> > SMTP Server (TLS) id 15.0.918.8 via Frontend Transport; Fri, 11 Apr 2014
> > 16:00:09 +0000
> > Received: from e002.nagler-company.com (212.185.86.227) by
> > DB3FFO11FD011.mail.protection.outlook.com (10.47.216.167) with Microsoft
> SMTP
> > Server (TLS) id 15.0.918.6 via Frontend Transport; Fri, 11 Apr 2014
> 16:00:08
> > +0000
> > Received: from mollari.NetBSD.org (mollari.netbsd.org [149.20.53.80])
> > by e002.nagler-company.com (8.14.7/8.14.7) with ESMTP id s3BG0378005931
> > for <Wolfgang.Stukenbrock%nagler-company.com@localhost>; Fri, 11 Apr
> 2014 18:00:06 +0200 (CEST)
> > Received: by mollari.NetBSD.org (Postfix, from userid 31008)
> > id C23A5A5828; Fri, 11 Apr 2014 16:00:01 +0000 (UTC)
> > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> > To: <kern-bug-people%netbsd.org@localhost>,
> <gnats-admin%netbsd.org@localhost>,
> > <netbsd-bugs%netbsd.org@localhost>,
> <Wolfgang.Stukenbrock%nagler-company.com@localhost>
> > Reply-To: <gnats-bugs%NetBSD.org@localhost>
> > Subject: Re: kern/48733: deadlock in if_output() with interrupt on
> KERNEL_LOCK
> > References: <pr-kern-48733%gnats.netbsd.org@localhost>
> > <20140411131311.74AF4123B93%test-s0.nagler-company.com@localhost>
> > X-Gnats-Was-Stupid: no
> > CC:
> > Message-ID: <20140411160001.C23A5A5828%mollari.NetBSD.org@localhost>
> > Date: Fri, 11 Apr 2014 16:00:01 +0000
> > Return-Path: gnats%NetBSD.org@localhost
> > X-EOPAttributedMessage: 0
> > X-MS-Exchange-Organization-MessageDirectionality: Incoming
> > X-Forefront-Antispam-Report:
> CIP:212.185.86.227;CTRY:DE;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(979002)(6009001)(428001)(50944004)(51704005)(24454002)(199002)(189002)(16796002)(46386002)(74502001)(70486001)(50466002)(87836001)(46102001)(90896003)(77982001)(52956003)(54356999)(50986999)(76176999)(53806999)(81542001)(81342001)(74662001)(45336002)(43066001)(83072002)(85852003)(6806004)(19580395003)(83322001)(42186004)(44976005)(80976001)(19580405001)(76482001)(99396002)(4396001)(80022001)(33656001)(20776003)(47776003)(48376002)(2201001)(79102001)(92726001)(42882001)(90966001)(969003)(989001)(999001)(1009001)(1019001);DIR:INB;SFP:;SCL:1;SRVR:DBXPR07MB317;H:e002.nagler-company.com;FPR:BF74F31D.9C06D725.B1F32CB3.4CE95053.203B9;PTR:e002.nagler-company.com;A:1;MX:1;LANG:en;
> > Content-Type: text/plain
> > X-MS-Exchange-Organization-Network-Message-Id:
> ae18e11e-2fa8-44eb-8125-08d123b63ff9
> > X-MS-Exchange-Organization-AVStamp-Service: 1.0
> > Received-SPF: None (: NetBSD.org does not designate permitted sender
> hosts)
> > X-MS-Exchange-Organization-SCL: 1
> > X-MS-Exchange-Organization-AuthSource: DB3FFO11FD011.protection.gbl
> > X-MS-Exchange-Organization-AuthAs: Anonymous
> > MIME-Version: 1.0
> >
> > The following reply was made to PR kern/48733; it has been noted by GNATS.
> >
> > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> > To: gnats-bugs%NetBSD.org@localhost
> > Cc: kern-bug-people%NetBSD.org@localhost,
> gnats-admin%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost
> > Subject: Re: kern/48733: deadlock in if_output() with interrupt on
> KERNEL_LOCK
> > Date: Fri, 11 Apr 2014 17:59:09 +0200
> >
> > On Fri, Apr 11, 2014 at 02:40:00PM +0000,
> Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
> > > >Description:
> > > Problem located in /src/sys/netinet/ip_output.c.
> > > Since file revision 1.208 the Kernel-Lock is locked prior
> calling if_output
> > > on the interface.
> > > Now - at least the wm-driver - will call splnet() and splx()
> inside the output
> > > routine.
> > > If any interrupt occurs in between splnet() and splx(), the
> interrupt is delayed and
> > > is processes in splx() when the level is released again.
> > > If such an interrupt is e.g. not MP-SAFE, the call stup in
> intr_biglock_wrapper() is
> > > used to call the interrupt routine and that one will lock the
> KERNEL-LOCK again.
> > > So we try to lock it again here -> dead-lock.
> > >
> > > Our system runs fine with 4 8257x interfaces, but after adding
> 2 additional 8254x
> > > interfaces, the system lock-up after a short time. Don't ask
> me, why the if_output
> > > call takes "to long" with theese two additonal interfaces, but
> it is reproducable.
> > > I've analysed this several times with DDB. Most times I've seen
> an USB-interrupt
> > > that dead-lock the system.
> >
> > I think your analsys is wrong. the KERNEL_LOCK is special in the sense
> that
> > it can be locked multiple time on the same CPU. So it's not a problem
> > that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
> > increase the lock count. A splx() on another CPU will wait for the
> > KERNEL_LOCK to be relased.
> >
> > I think your problem is more likely in the USB stack.
> > Maybe one of your new ethernet interface shares an interrupt with the
> > USB controller ?
> >
> >
> > > >How-To-Repeat:
> > > Run a lot of trafic over wm-interfaces and do shomething e.g.
> on USB at the same
> > > time. It is just a question of time till system-dead-lock.
> > > >Fix:
> > > Fist guess: revert change done from 1.207 to 1.208.
> > > But I've no idea about side effects.
> >
> > Very bad: the output queues are protected by the KERNEL_LOCK and
> splnet().
> > If you revert ip_output 1.208, you'll also have to revert ip_input.c
> > 1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK
> again.
> >
> > --
> > Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> > NetBSD: 26 ans d'experience feront toujours la difference
> > --
> >
> >
>
>
> --
>
>
> Dr. Nagler & Company GmbH
> Hauptstraße 9
> 92253 Schnaittenbach
>
> Tel. +49 9622/71 97-42
> Fax +49 9622/71 97-50
>
> Wolfgang.Stukenbrock%nagler-company.com@localhost
> http://www.nagler-company.com
>
>
> Hauptsitz: Schnaittenbach
> Handelregister: Amberg HRB
> Gerichtsstand: Amberg
> Steuernummer: 201/118/51825
> USt.-ID-Nummer: DE 273143997
> Geschäftsführer: Dr. Martin Nagler
>
>
>
--
Dr. Nagler & Company GmbH
Hauptstraße 9
92253 Schnaittenbach
Tel. +49 9622/71 97-42
Fax +49 9622/71 97-50
Wolfgang.Stukenbrock%nagler-company.com@localhost
http://www.nagler-company.com
Hauptsitz: Schnaittenbach
Handelregister: Amberg HRB
Gerichtsstand: Amberg
Steuernummer: 201/118/51825
USt.-ID-Nummer: DE 273143997
Geschäftsführer: Dr. Martin Nagler
Home |
Main Index |
Thread Index |
Old Index