NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK



The following reply was made to PR kern/48733; it has been noted by GNATS.

From: Wolfgang Stukenbrock <wolfgang.stukenbrock%nagler-company.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, 
netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/48733: deadlock in if_output() with interrupt on KERNEL_LOCK
Date: Sun, 27 Apr 2014 10:21:03 +0200

 Hi, sorry for the delay, but I'm working off-side most time at the moment.
 
 I have to look onto the system to validate the assumption, that the USB 
 shares the interupt. (I remember that there are some interrupts shared.)
 But I can do this not earlyer as 02.05.2014 or later ...
 And lowering the spl-level does not imply that the interrupt has to be 
 shared with the wm-driver to run into this problem.
 
 The USB-interupt is not the only one I've seen.
 And I've analysed the stack backtrace that has dead-locked the system 
 and it locks up in the interrupt code while trying to get the KERNEL_LOCK!
 I will have a look at this again, but as far as I remember it definitly 
 locks up in the splx while trying to get the KERNEL_LOCK. splx does not 
 return anymore and I've got the interrupt source by translating the jump 
 address to a symbol that has not been reached anymore.
 (I've still not commented out the LOCK/UNLOCK prior calling the 
 output-routine to validate my analyses - no time till now, sorry - but I 
 will do it as soon as possible.)
 
 I will send the results of my review as soon as I've be able to do it.
 
 best regards
 
 W. Stukenbrock
 
 Manuel Bouyer wrote:
 
 > The following reply was made to PR kern/48733; it has been noted by GNATS.
 > 
 > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 > To: gnats-bugs%NetBSD.org@localhost
 > Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
 > netbsd-bugs%NetBSD.org@localhost
 > Subject: Re: kern/48733: deadlock in if_output() with interrupt on 
 > KERNEL_LOCK
 > Date: Fri, 11 Apr 2014 17:59:09 +0200
 > 
 >  On Fri, Apr 11, 2014 at 02:40:00PM +0000, 
 > Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
 >  > >Description:
 >  >   Problem located in /src/sys/netinet/ip_output.c.
 >  >   Since file revision 1.208 the Kernel-Lock is locked prior calling 
 > if_output
 >  >   on the interface.
 >  >   Now - at least the wm-driver - will call splnet() and splx() inside the 
 > output
 >  >   routine.
 >  >   If any interrupt occurs in between splnet() and splx(), the interrupt 
 > is delayed and
 >  >   is processes in splx() when the level is released again.
 >  >   If such an interrupt is e.g. not MP-SAFE, the call stup in 
 > intr_biglock_wrapper() is
 >  >   used to call the interrupt routine and that one will lock the 
 > KERNEL-LOCK again.
 >  >   So we try to lock it again here -> dead-lock.
 >  > 
 >  >   Our system runs fine with 4 8257x interfaces, but after adding 2 
 > additional 8254x
 >  >   interfaces, the system lock-up after a short time. Don't ask me, why 
 > the if_output
 >  >   call takes "to long" with theese two additonal interfaces, but it is 
 > reproducable.
 >  >   I've analysed this several times with DDB. Most times I've seen an 
 > USB-interrupt
 >  >   that dead-lock the system.
 >  
 >  I think your analsys is wrong. the KERNEL_LOCK is special in the sense that
 >  it can be locked multiple time on the same CPU. So it's not a problem
 >  that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
 >  increase the lock count. A splx() on another CPU will wait for the
 >  KERNEL_LOCK to be relased.
 >  
 >  I think your problem is more likely in the USB stack.
 >  Maybe one of your new ethernet interface shares an interrupt with the
 >  USB controller ?
 >  
 >  
 >  > >How-To-Repeat:
 >  >   Run a lot of trafic over wm-interfaces and do shomething e.g. on USB at 
 > the same
 >  >   time. It is just a question of time till system-dead-lock.
 >  > >Fix:
 >  >   Fist guess: revert change done from 1.207 to 1.208.
 >  >   But I've no idea about side effects.
 >  
 >  Very bad: the output queues are protected by the KERNEL_LOCK and splnet().
 >  If you revert ip_output 1.208, you'll also have to revert ip_input.c
 >  1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK 
 > again.
 >  
 >  -- 
 >  Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 >       NetBSD: 26 ans d'experience feront toujours la difference
 >  --
 >  
 > 
 > 
 > Received: from DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) by
 >  DBXPR07MB319.eurprd07.prod.outlook.com (10.141.12.141) with Microsoft SMTP
 >  Server (TLS) id 15.0.918.8 via Mailbox Transport; Fri, 11 Apr 2014 16:00:11
 >  +0000
 > Received: from DBXPR07CA001.eurprd07.prod.outlook.com (10.255.191.159) by
 >  DBXPR07MB317.eurprd07.prod.outlook.com (10.141.12.139) with Microsoft SMTP
 >  Server (TLS) id 15.0.918.8; Fri, 11 Apr 2014 16:00:10 +0000
 > Received: from DB3FFO11FD011.protection.gbl (2a01:111:f400:7e04::177) by
 >  DBXPR07CA001.outlook.office365.com (2a01:111:e400:9800::31) with Microsoft
 >  SMTP Server (TLS) id 15.0.918.8 via Frontend Transport; Fri, 11 Apr 2014
 >  16:00:09 +0000
 > Received: from e002.nagler-company.com (212.185.86.227) by
 >  DB3FFO11FD011.mail.protection.outlook.com (10.47.216.167) with Microsoft 
 > SMTP
 >  Server (TLS) id 15.0.918.6 via Frontend Transport; Fri, 11 Apr 2014 16:00:08
 >  +0000
 > Received: from mollari.NetBSD.org (mollari.netbsd.org [149.20.53.80])
 >      by e002.nagler-company.com (8.14.7/8.14.7) with ESMTP id s3BG0378005931
 >      for <Wolfgang.Stukenbrock%nagler-company.com@localhost>; Fri, 11 Apr 
 > 2014 18:00:06 +0200 (CEST)
 > Received: by mollari.NetBSD.org (Postfix, from userid 31008)
 >      id C23A5A5828; Fri, 11 Apr 2014 16:00:01 +0000 (UTC)
 > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 > To: <kern-bug-people%netbsd.org@localhost>, 
 > <gnats-admin%netbsd.org@localhost>,
 >      <netbsd-bugs%netbsd.org@localhost>, 
 > <Wolfgang.Stukenbrock%nagler-company.com@localhost>
 > Reply-To: <gnats-bugs%NetBSD.org@localhost>
 > Subject: Re: kern/48733: deadlock in if_output() with interrupt on 
 > KERNEL_LOCK
 > References: <pr-kern-48733%gnats.netbsd.org@localhost>
 >   <20140411131311.74AF4123B93%test-s0.nagler-company.com@localhost>
 > X-Gnats-Was-Stupid: no
 > CC:
 > Message-ID: <20140411160001.C23A5A5828%mollari.NetBSD.org@localhost>
 > Date: Fri, 11 Apr 2014 16:00:01 +0000
 > Return-Path: gnats%NetBSD.org@localhost
 > X-EOPAttributedMessage: 0
 > X-MS-Exchange-Organization-MessageDirectionality: Incoming
 > X-Forefront-Antispam-Report: 
 > CIP:212.185.86.227;CTRY:DE;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(979002)(6009001)(428001)(50944004)(51704005)(24454002)(199002)(189002)(16796002)(46386002)(74502001)(70486001)(50466002)(87836001)(46102001)(90896003)(77982001)(52956003)(54356999)(50986999)(76176999)(53806999)(81542001)(81342001)(74662001)(45336002)(43066001)(83072002)(85852003)(6806004)(19580395003)(83322001)(42186004)(44976005)(80976001)(19580405001)(76482001)(99396002)(4396001)(80022001)(33656001)(20776003)(47776003)(48376002)(2201001)(79102001)(92726001)(42882001)(90966001)(969003)(989001)(999001)(1009001)(1019001);DIR:INB;SFP:;SCL:1;SRVR:DBXPR07MB317;H:e002.nagler-company.com;FPR:BF74F31D.9C06D725.B1F32CB3.4CE95053.203B9;PTR:e002.nagler-company.com;A:1;MX:1;LANG:en;
 > Content-Type: text/plain
 > X-MS-Exchange-Organization-Network-Message-Id: 
 > ae18e11e-2fa8-44eb-8125-08d123b63ff9
 > X-MS-Exchange-Organization-AVStamp-Service: 1.0
 > Received-SPF: None (: NetBSD.org does not designate permitted sender hosts)
 > X-MS-Exchange-Organization-SCL: 1
 > X-MS-Exchange-Organization-AuthSource: DB3FFO11FD011.protection.gbl
 > X-MS-Exchange-Organization-AuthAs: Anonymous
 > MIME-Version: 1.0
 > 
 > The following reply was made to PR kern/48733; it has been noted by GNATS.
 > 
 > From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 > To: gnats-bugs%NetBSD.org@localhost
 > Cc: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, 
 > netbsd-bugs%NetBSD.org@localhost
 > Subject: Re: kern/48733: deadlock in if_output() with interrupt on 
 > KERNEL_LOCK
 > Date: Fri, 11 Apr 2014 17:59:09 +0200
 > 
 >  On Fri, Apr 11, 2014 at 02:40:00PM +0000, 
 > Wolfgang.Stukenbrock%nagler-company.com@localhost wrote:
 >  > >Description:
 >  >   Problem located in /src/sys/netinet/ip_output.c.
 >  >   Since file revision 1.208 the Kernel-Lock is locked prior calling 
 > if_output
 >  >   on the interface.
 >  >   Now - at least the wm-driver - will call splnet() and splx() inside the 
 > output
 >  >   routine.
 >  >   If any interrupt occurs in between splnet() and splx(), the interrupt 
 > is delayed and
 >  >   is processes in splx() when the level is released again.
 >  >   If such an interrupt is e.g. not MP-SAFE, the call stup in 
 > intr_biglock_wrapper() is
 >  >   used to call the interrupt routine and that one will lock the 
 > KERNEL-LOCK again.
 >  >   So we try to lock it again here -> dead-lock.
 >  > 
 >  >   Our system runs fine with 4 8257x interfaces, but after adding 2 
 > additional 8254x
 >  >   interfaces, the system lock-up after a short time. Don't ask me, why 
 > the if_output
 >  >   call takes "to long" with theese two additonal interfaces, but it is 
 > reproducable.
 >  >   I've analysed this several times with DDB. Most times I've seen an 
 > USB-interrupt
 >  >   that dead-lock the system.
 >  
 >  I think your analsys is wrong. the KERNEL_LOCK is special in the sense that
 >  it can be locked multiple time on the same CPU. So it's not a problem
 >  that splx() on the same CPU tries to get KERNEL_LOCK again, it will just
 >  increase the lock count. A splx() on another CPU will wait for the
 >  KERNEL_LOCK to be relased.
 >  
 >  I think your problem is more likely in the USB stack.
 >  Maybe one of your new ethernet interface shares an interrupt with the
 >  USB controller ?
 >  
 >  
 >  > >How-To-Repeat:
 >  >   Run a lot of trafic over wm-interfaces and do shomething e.g. on USB at 
 > the same
 >  >   time. It is just a question of time till system-dead-lock.
 >  > >Fix:
 >  >   Fist guess: revert change done from 1.207 to 1.208.
 >  >   But I've no idea about side effects.
 >  
 >  Very bad: the output queues are protected by the KERNEL_LOCK and splnet().
 >  If you revert ip_output 1.208, you'll also have to revert ip_input.c
 >  1.286 and 1.285, so that the whole IP stack runs under the KERNEL_LOCK 
 > again.
 >  
 >  -- 
 >  Manuel Bouyer <bouyer%antioche.eu.org@localhost>
 >       NetBSD: 26 ans d'experience feront toujours la difference
 >  --
 >  
 > 
 
 
 -- 
 
 
 Dr. Nagler & Company GmbH
 Hauptstraße 9
 92253 Schnaittenbach
 
 Tel. +49 9622/71 97-42
 Fax +49 9622/71 97-50
 
 Wolfgang.Stukenbrock%nagler-company.com@localhost
 http://www.nagler-company.com
 
 
 Hauptsitz: Schnaittenbach
 Handelregister: Amberg HRB
 Gerichtsstand: Amberg
 Steuernummer: 201/118/51825
 USt.-ID-Nummer: DE 273143997
 Geschäftsführer: Dr. Martin Nagler
 
 


Home | Main Index | Thread Index | Old Index