Manuel Bouyer <bouyer%antioche.eu.org@localhost> writes: > On Tue, Mar 06, 2012 at 11:56:45AM -0500, Beverly Schwartz wrote: >> ddb backtrace produces: >> vpanic >> kern_assert >> bnx_start >> bnx_alloc_pkts >> workqueue_worker > > thanks, so the problem is really the workqueue that should not > be marked MPSAFE ... Restating some off-list discussion for the record, now that we've figured it out: bnx_start can defer work to allocate tx data structures via a workqueue the workqueue registration is marked MPSAFE so when the workqueue calls the alloc routines, the kernel lock is not held the alloc routine calls bnx_start, and it protects that with splnet, but it hasn't taken the kernel lock so bnx_start (the second time on the first packet) is running at splnet, without the kernel lock. This triggers the assert. if the assert isn't there, then there's the possibility of another processor handling an interrupt and calling bnx_start. Both the workqueue-called copy and the intr-called copy will be at splnet, but on differerent processors. The above is typically rare, and it seems to take heavy load to trigger it sometimes. It's probably the combination of multiple TCPs opening up cwnd and the CPU utilization getting high that leads to the unintended concurrency. The proposed fix is to not mark bnx's workqueue MPSAFE (instead of the patch I sent earlier).
Attachment:
pgpHRgtw_BLb2.pgp
Description: PGP signature