The following reply was made to PR kern/48733; it has been noted by GNATS.
From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Sun, 27 Apr 2014 10:21:03 +0200
Hi, sorry for the delay, but I'm working off-side most time at the moment.
I have to look onto the system to validate the assumption, that the USB
shares the interupt. (I remember that there are some interrupts shared.)
But I can do this not earlyer as 02.05.2014 or later ...
And lowering the spl-level does not imply that the interrupt has to be
shared with the wm-driver to run into this problem.
The USB-interupt is not the only one I've seen.
And I've analysed the stack backtrace that has dead-locked the system
and it locks up in the interrupt code while trying to get the KERNEL_LOCK!
I will have a look at this again, but as far as I remember it definitly
locks up in the splx while trying to get the KERNEL_LOCK. splx does not
return anymore and I've got the interrupt source by translating the jump
address to a symbol that has not been reached anymore.
(I've still not commented out the LOCK/UNLOCK prior calling the
output-routine to validate my analyses - no time till now, sorry - but I
will do it as soon as possible.)
I will send the results of my review as soon as I've be able to do it.
best regards
W. Stukenbrock
Manuel Bouyer wrote:
> The following reply was made to PR kern/48733; it has been noted by GNATS.
>
> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost,
netbsd-bugs%NetBSD.org@localhost
> Subject: Re: kern/48733: deadlock in if_output() with interrupt on
KERNEL_LOCK
> Date: Fri, 11 Apr 2014 17:59:09 +0200
>
> On Fri, Apr 11, 2014 at 02:40:00PM +0000, Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
> > >Description:
> > Problem located in /src/sys/netinet/ip_output.c.
> > Since file revision 1.208 the Kernel-Lock is locked prior calling
if_output
> > on the interface.
> > Now - at least the wm-driver - will call splnet() and splx() inside
the output
> > routine.
> > If any interrupt occurs in between splnet() and splx(), the interrupt
is delayed and
> > is processes in splx() when the level is released again.
> > If such an interrupt is e.g. not MP-SAFE, the call stup in
intr_biglock_wrapper() is
> > used to call the interrupt routine and that one will lock the
KERNEL-LOCK again.
> > So we try to lock it again here -> dead-lock.
> >
> > Our system runs fine with 4 8257x interfaces, but after adding 2 additional 8254x
> > interfaces, the system lock-up after a short time. Don't ask me, why
the if_output
> > call takes "to long" with theese two additonal interfaces, but it is
reproducable.
> > I've analysed this several times with DDB. Most times I've seen an
USB-interrupt
> > that dead-lock the system.
>
> I think your analsys is wrong. the KERNEL_LOCK is special in the sense that
> it can be locked multiple time on the same CPU. So it's not a problem
> that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
> increase the lock count. A splx() on another CPU will wait for the
> KERNEL_LOCK to be relased.
>
> I think your problem is more likely in the USB stack.
> Maybe one of your new ethernet interface shares an interrupt with the
> USB controller ?
>
>
> > >How-To-Repeat:
> > Run a lot of trafic over wm-interfaces and do shomething e.g. on USB
at the same
> > time. It is just a question of time till system-dead-lock.
> > >Fix:
> > Fist guess: revert change done from 1.207 to 1.208.
> > But I've no idea about side effects.
>
> Very bad: the output queues are protected by the KERNEL_LOCK and splnet().
> If you revert ip_output 1.208, you'll also have to revert ip_input.c
> 1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK
again.
>
> --
> Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> NetBSD: 26 ans d'experience feront toujours la difference
> --
>
>
>
> Received: from DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) by
> DBXPR07MB319.eurprd07.prod.outlook.com (10.141.12.141) with Microsoft SMTP
> Server (TLS) id 15.0.918.8 via Mailbox Transport; Fri, 11 Apr 2014 16:00:11
> +0000
> Received: from DBXPR07CA001.eurprd07.prod.outlook.com (10.255.191.159) by
> DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) with Microsoft SMTP
> Server (TLS) id 15.0.918.8; Fri, 11 Apr 2014 16:00:10 +0000
> Received: from DB3FFO11FD011.protection.gbl (2a01:111:f400:7e04::177) by
> DBXPR07CA001.outlook.office365.com (2a01:111:e400:9800::31) with Microsoft
> SMTP Server (TLS) id 15.0.918.8 via Frontend Transport; Fri, 11 Apr 2014
> 16:00:09 +0000
> Received: from e002.nagler-company.com (212.185.86.227) by
> DB3FFO11FD011.mail.protection.outlook.com (10.47.216.167) with Microsoft
SMTP
> Server (TLS) id 15.0.918.6 via Frontend Transport; Fri, 11 Apr 2014 16:00:08
> +0000
> Received: from mollari.NetBSD.org (mollari.netbsd.org [149.20.53.80])
> by e002.nagler-company.com (8.14.7/8.14.7) with ESMTP id s3BG0378005931
> for <Wolfgang.Stukenbrock%nagler-company.com@localhost>; Fri, 11 Apr 2014
18:00:06 +0200 (CEST)
> Received: by mollari.NetBSD.org (Postfix, from userid 31008)
> id C23A5A5828; Fri, 11 Apr 2014 16:00:01 +0000 (UTC)
> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> To: <kern-bug-people%netbsd.org@localhost>,
<gnats-admin%netbsd.org@localhost>,
> <netbsd-bugs%netbsd.org@localhost>,
<Wolfgang.Stukenbrock%nagler-company.com@localhost>
> Reply-To: <gnats-bugs%NetBSD.org@localhost>
> Subject: Re: kern/48733: deadlock in if_output() with interrupt on
KERNEL_LOCK
> References: <pr-kern-48733%gnats.netbsd.org@localhost>
> <20140411131311.74AF4123B93%test-s0.nagler-company.com@localhost>
> X-Gnats-Was-Stupid: no
> CC:
> Message-ID: <20140411160001.C23A5A5828%mollari.NetBSD.org@localhost>
> Date: Fri, 11 Apr 2014 16:00:01 +0000
> Return-Path: gnats%NetBSD.org@localhost
> X-EOPAttributedMessage: 0
> X-MS-Exchange-Organization-MessageDirectionality: Incoming
> X-Forefront-Antispam-Report:
CIP:212.185.86.227;CTRY:DE;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(979002)(6009001)(428001)(50944004)(51704005)(24454002)(199002)(189002)(16796002)(46386002)(74502001)(70486001)(50466002)(87836001)(46102001)(90896003)(77982001)(52956003)(54356999)(50986999)(76176999)(53806999)(81542001)(81342001)(74662001)(45336002)(43066001)(83072002)(85852003)(6806004)(19580395003)(83322001)(42186004)(44976005)(80976001)(19580405001)(76482001)(99396002)(4396001)(80022001)(33656001)(20776003)(47776003)(48376002)(2201001)(79102001)(92726001)(42882001)(90966001)(969003)(989001)(999001)(1009001)(1019001);DIR:INB;SFP:;SCL:1;SRVR:DBXPR07MB317;H:e002.nagler-company.com;FPR:BF74F31D.9C06D725.B1F32CB3.4CE95053.203B9;PTR:e002.nagler-company.com;A:1;MX:1;LANG:en;
> Content-Type: text/plain
> X-MS-Exchange-Organization-Network-Message-Id:
ae18e11e-2fa8-44eb-8125-08d123b63ff9
> X-MS-Exchange-Organization-AVStamp-Service: 1.0
> Received-SPF: None (: NetBSD.org does not designate permitted sender hosts)
> X-MS-Exchange-Organization-SCL: 1
> X-MS-Exchange-Organization-AuthSource: DB3FFO11FD011.protection.gbl
> X-MS-Exchange-Organization-AuthAs: Anonymous
> MIME-Version: 1.0
>
> The following reply was made to PR kern/48733; it has been noted by GNATS.
>
> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost,
netbsd-bugs%NetBSD.org@localhost
> Subject: Re: kern/48733: deadlock in if_output() with interrupt on
KERNEL_LOCK
> Date: Fri, 11 Apr 2014 17:59:09 +0200
>
> On Fri, Apr 11, 2014 at 02:40:00PM +0000, Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
> > >Description:
> > Problem located in /src/sys/netinet/ip_output.c.
> > Since file revision 1.208 the Kernel-Lock is locked prior calling
if_output
> > on the interface.
> > Now - at least the wm-driver - will call splnet() and splx() inside
the output
> > routine.
> > If any interrupt occurs in between splnet() and splx(), the interrupt
is delayed and
> > is processes in splx() when the level is released again.
> > If such an interrupt is e.g. not MP-SAFE, the call stup in
intr_biglock_wrapper() is
> > used to call the interrupt routine and that one will lock the
KERNEL-LOCK again.
> > So we try to lock it again here -> dead-lock.
> >
> > Our system runs fine with 4 8257x interfaces, but after adding 2 additional 8254x
> > interfaces, the system lock-up after a short time. Don't ask me, why
the if_output
> > call takes "to long" with theese two additonal interfaces, but it is
reproducable.
> > I've analysed this several times with DDB. Most times I've seen an
USB-interrupt
> > that dead-lock the system.
>
> I think your analsys is wrong. the KERNEL_LOCK is special in the sense that
> it can be locked multiple time on the same CPU. So it's not a problem
> that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
> increase the lock count. A splx() on another CPU will wait for the
> KERNEL_LOCK to be relased.
>
> I think your problem is more likely in the USB stack.
> Maybe one of your new ethernet interface shares an interrupt with the
> USB controller ?
>
>
> > >How-To-Repeat:
> > Run a lot of trafic over wm-interfaces and do shomething e.g. on USB
at the same
> > time. It is just a question of time till system-dead-lock.
> > >Fix:
> > Fist guess: revert change done from 1.207 to 1.208.
> > But I've no idea about side effects.
>
> Very bad: the output queues are protected by the KERNEL_LOCK and splnet().
> If you revert ip_output 1.208, you'll also have to revert ip_input.c
> 1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK
again.
>
> --
> Manuel Bouyer <bouyer%antioche.eu.org@localhost>
> NetBSD: 26 ans d'experience feront toujours la difference
> --
>
>
--
Dr. Nagler & Company GmbH
Hauptstraße 9
92253 Schnaittenbach
Tel. +49 9622/71 97-42
Fax +49 9622/71 97-50
Wolfgang.Stukenbrock%nagler-company.com@localhost
http://www.nagler-company.com
Hauptsitz: Schnaittenbach
Handelregister: Amberg HRB
Gerichtsstand: Amberg
Steuernummer: 201/118/51825
USt.-ID-Nummer: DE 273143997
Geschäftsführer: Dr. Martin Nagler