tech-net archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: apparently missing locking in if_bnx.c



Manuel Bouyer <bouyer%antioche.eu.org@localhost> writes:

> On Tue, Mar 06, 2012 at 11:56:45AM -0500, Beverly Schwartz wrote:
>> ddb backtrace produces:
>> vpanic
>> kern_assert
>> bnx_start
>> bnx_alloc_pkts
>> workqueue_worker
>
> thanks, so the problem is really the workqueue that should not
> be marked MPSAFE ...

Restating some off-list discussion for the record, now that we've
figured it out:

  bnx_start can defer work to allocate tx data structures via a
  workqueue

  the workqueue registration is marked MPSAFE

  so when the workqueue calls the alloc routines, the kernel lock is not
  held

  the alloc routine calls bnx_start, and it protects that with splnet,
  but it hasn't taken the kernel lock

  so bnx_start (the second time on the first packet) is running at
  splnet, without the kernel lock.   This triggers the assert.

  if the assert isn't there, then there's the possibility of another
  processor handling an interrupt and calling bnx_start.  Both the
  workqueue-called copy and the intr-called copy will be at splnet, but
  on differerent processors.

  The above is typically rare, and it seems to take heavy load to
  trigger it sometimes.  It's probably the combination of multiple TCPs
  opening up cwnd and the CPU utilization getting high that leads to the
  unintended concurrency.

  The proposed fix is to not mark bnx's workqueue MPSAFE (instead of the
  patch I sent earlier).


Attachment: pgpHRgtw_BLb2.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index