NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK



Hi, sorry for the delay, but I'm working off-side most time at the moment.

I have to look onto the system to validate the assumption, that the USB shares the interupt. (I remember that there are some interrupts shared.)
But I can do this not earlyer as 02.05.2014 or later ...
And lowering the spl-level does not imply that the interrupt has to be shared with the wm-driver to run into this problem.

The USB-interupt is not the only one I've seen.
And I've analysed the stack backtrace that has dead-locked the system and it locks up in the interrupt code while trying to get the KERNEL_LOCK! I will have a look at this again, but as far as I remember it definitly locks up in the splx while trying to get the KERNEL_LOCK. splx does not return anymore and I've got the interrupt source by translating the jump address to a symbol that has not been reached anymore. (I've still not commented out the LOCK/UNLOCK prior calling the output-routine to validate my analyses - no time till now, sorry - but I will do it as soon as possible.)

I will send the results of my review as soon as I've be able to do it.

best regards

W. Stukenbrock

Manuel Bouyer wrote:

The following reply was made to PR kern/48733; it has been noted by GNATS.

From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
netbsd-bugs%NetBSD.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Fri, 11 Apr 2014 17:59:09 +0200

 On Fri, Apr 11, 2014 at 02:40:00PM +0000, 
Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
 > >Description:
 >   Problem located in /src/sys/netinet/ip_output.c.
 >   Since file revision 1.208 the Kernel-Lock is locked prior calling if_output
 >   on the interface.
 >   Now - at least the wm-driver - will call splnet() and splx() inside the 
output
 >   routine.
 >   If any interrupt occurs in between splnet() and splx(), the interrupt is 
delayed and
 >   is processes in splx() when the level is released again.
 >   If such an interrupt is e.g. not MP-SAFE, the call stup in 
intr_biglock_wrapper() is
 >   used to call the interrupt routine and that one will lock the KERNEL-LOCK 
again.
 >   So we try to lock it again here -> dead-lock.
> > Our system runs fine with 4 8257x interfaces, but after adding 2 additional 8254x
 >   interfaces, the system lock-up after a short time. Don't ask me, why the 
if_output
 >   call takes "to long" with theese two additonal interfaces, but it is 
reproducable.
 >   I've analysed this several times with DDB. Most times I've seen an 
USB-interrupt
 >   that dead-lock the system.
I think your analsys is wrong. the KERNEL_LOCK is special in the sense that
 it can be locked multiple time on the same CPU. So it's not a problem
 that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
 increase the lock count. A splx() on another CPU will wait for the
 KERNEL_LOCK to be relased.
I think your problem is more likely in the USB stack.
 Maybe one of your new ethernet interface shares an interrupt with the
 USB controller ?
> >How-To-Repeat:
 >   Run a lot of trafic over wm-interfaces and do shomething e.g. on USB at 
the same
 >   time. It is just a question of time till system-dead-lock.
 > >Fix:
 >   Fist guess: revert change done from 1.207 to 1.208.
 >   But I've no idea about side effects.
Very bad: the output queues are protected by the KERNEL_LOCK and splnet().
 If you revert ip_output 1.208, you'll also have to revert ip_input.c
 1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK again.
-- Manuel Bouyer <bouyer%antioche.eu.org@localhost>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

Received: from DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) by
 DBXPR07MB319.eurprd07.prod.outlook.com (10.141.12.141) with Microsoft SMTP
 Server (TLS) id 15.0.918.8 via Mailbox Transport; Fri, 11 Apr 2014 16:00:11
 +0000
Received: from DBXPR07CA001.eurprd07.prod.outlook.com (10.255.191.159) by
 DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) with Microsoft SMTP
 Server (TLS) id 15.0.918.8; Fri, 11 Apr 2014 16:00:10 +0000
Received: from DB3FFO11FD011.protection.gbl (2a01:111:f400:7e04::177) by
 DBXPR07CA001.outlook.office365.com (2a01:111:e400:9800::31) with Microsoft
 SMTP Server (TLS) id 15.0.918.8 via Frontend Transport; Fri, 11 Apr 2014
 16:00:09 +0000
Received: from e002.nagler-company.com (212.185.86.227) by
 DB3FFO11FD011.mail.protection.outlook.com (10.47.216.167) with Microsoft SMTP
 Server (TLS) id 15.0.918.6 via Frontend Transport; Fri, 11 Apr 2014 16:00:08
 +0000
Received: from mollari.NetBSD.org (mollari.netbsd.org [149.20.53.80])
        by e002.nagler-company.com (8.14.7/8.14.7) with ESMTP id s3BG0378005931
        for <Wolfgang.Stukenbrock%nagler-company.com@localhost>; Fri, 11 Apr 
2014 18:00:06 +0200 (CEST)
Received: by mollari.NetBSD.org (Postfix, from userid 31008)
        id C23A5A5828; Fri, 11 Apr 2014 16:00:01 +0000 (UTC)
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
To: <kern-bug-people%netbsd.org@localhost>, <gnats-admin%netbsd.org@localhost>,
        <netbsd-bugs%netbsd.org@localhost>, 
<Wolfgang.Stukenbrock%nagler-company.com@localhost>
Reply-To: <gnats-bugs%NetBSD.org@localhost>
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
References: <pr-kern-48733%gnats.netbsd.org@localhost>
  <20140411131311.74AF4123B93%test-s0.nagler-company.com@localhost>
X-Gnats-Was-Stupid: no
CC:
Message-ID: <20140411160001.C23A5A5828%mollari.NetBSD.org@localhost>
Date: Fri, 11 Apr 2014 16:00:01 +0000
Return-Path: gnats%NetBSD.org@localhost
X-EOPAttributedMessage: 0
X-MS-Exchange-Organization-MessageDirectionality: Incoming
X-Forefront-Antispam-Report: 
CIP:212.185.86.227;CTRY:DE;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(979002)(6009001)(428001)(50944004)(51704005)(24454002)(199002)(189002)(16796002)(46386002)(74502001)(70486001)(50466002)(87836001)(46102001)(90896003)(77982001)(52956003)(54356999)(50986999)(76176999)(53806999)(81542001)(81342001)(74662001)(45336002)(43066001)(83072002)(85852003)(6806004)(19580395003)(83322001)(42186004)(44976005)(80976001)(19580405001)(76482001)(99396002)(4396001)(80022001)(33656001)(20776003)(47776003)(48376002)(2201001)(79102001)(92726001)(42882001)(90966001)(969003)(989001)(999001)(1009001)(1019001);DIR:INB;SFP:;SCL:1;SRVR:DBXPR07MB317;H:e002.nagler-company.com;FPR:BF74F31D.9C06D725.B1F32CB3.4CE95053.203B9;PTR:e002.nagler-company.com;A:1;MX:1;LANG:en;
Content-Type: text/plain
X-MS-Exchange-Organization-Network-Message-Id: 
ae18e11e-2fa8-44eb-8125-08d123b63ff9
X-MS-Exchange-Organization-AVStamp-Service: 1.0
Received-SPF: None (: NetBSD.org does not designate permitted sender hosts)
X-MS-Exchange-Organization-SCL: 1
X-MS-Exchange-Organization-AuthSource: DB3FFO11FD011.protection.gbl
X-MS-Exchange-Organization-AuthAs: Anonymous
MIME-Version: 1.0

The following reply was made to PR kern/48733; it has been noted by GNATS.

From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
netbsd-bugs%NetBSD.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Fri, 11 Apr 2014 17:59:09 +0200

 On Fri, Apr 11, 2014 at 02:40:00PM +0000, 
Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
 > >Description:
 >   Problem located in /src/sys/netinet/ip_output.c.
 >   Since file revision 1.208 the Kernel-Lock is locked prior calling if_output
 >   on the interface.
 >   Now - at least the wm-driver - will call splnet() and splx() inside the 
output
 >   routine.
 >   If any interrupt occurs in between splnet() and splx(), the interrupt is 
delayed and
 >   is processes in splx() when the level is released again.
 >   If such an interrupt is e.g. not MP-SAFE, the call stup in 
intr_biglock_wrapper() is
 >   used to call the interrupt routine and that one will lock the KERNEL-LOCK 
again.
 >   So we try to lock it again here -> dead-lock.
> > Our system runs fine with 4 8257x interfaces, but after adding 2 additional 8254x
 >   interfaces, the system lock-up after a short time. Don't ask me, why the 
if_output
 >   call takes "to long" with theese two additonal interfaces, but it is 
reproducable.
 >   I've analysed this several times with DDB. Most times I've seen an 
USB-interrupt
 >   that dead-lock the system.
I think your analsys is wrong. the KERNEL_LOCK is special in the sense that
 it can be locked multiple time on the same CPU. So it's not a problem
 that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
 increase the lock count. A splx() on another CPU will wait for the
 KERNEL_LOCK to be relased.
I think your problem is more likely in the USB stack.
 Maybe one of your new ethernet interface shares an interrupt with the
 USB controller ?
> >How-To-Repeat:
 >   Run a lot of trafic over wm-interfaces and do shomething e.g. on USB at 
the same
 >   time. It is just a question of time till system-dead-lock.
 > >Fix:
 >   Fist guess: revert change done from 1.207 to 1.208.
 >   But I've no idea about side effects.
Very bad: the output queues are protected by the KERNEL_LOCK and splnet().
 If you revert ip_output 1.208, you'll also have to revert ip_input.c
 1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK again.
-- Manuel Bouyer <bouyer%antioche.eu.org@localhost>
      NetBSD: 26 ans d'experience feront toujours la difference
 --


--


Dr. Nagler & Company GmbH
Hauptstraße 9
92253 Schnaittenbach

Tel. +49 9622/71 97-42
Fax +49 9622/71 97-50

Wolfgang.Stukenbrock%nagler-company.com@localhost
http://www.nagler-company.com


Hauptsitz: Schnaittenbach
Handelregister: Amberg HRB
Gerichtsstand: Amberg
Steuernummer: 201/118/51825
USt.-ID-Nummer: DE 273143997
Geschäftsführer: Dr. Martin Nagler




Home | Main Index | Thread Index | Old Index