For others trying to repeat this kind of stress test: Note that we've found that actually triggering a problem seems to be very dependent on all sorts of things that shouldn't matter, e.g. i386 vs amd64, firmware revisions, etc. But that may be about bugs in private code; we don't have enough experience to make this statement about the workqueue/MPSAFE bug. We were reliably able to induce a lockup with netbsd-6 from yesterday 2 machines 3 bnx each, cabled back-to-back in pairs each machine runs a web server, with a ~10G+ file each machine runs 3 wget, pulling per interface from the other machine so this is 6 tcp streams, one per direction on each of 3 pairs of interfaces. With the workququeue/remove-MPSAFE patch, the machines are totally solid under this load. With the mutex patch I posted earlier, they were almost solid, but not quite (probably because access to the tx dma setup hardware was not serialized). Further, with the patch and LOCKDEBUG, the systems run without crashing/panicing, but about 40x slow. Without the patch and with LOCKDEBUG, there were mysterious hangs. I would expect that on most machines, it wouldn't be possible to provoke the bug with only one interface. My understanding is that the above stress test with 3 pairs of wm (yesterday or today netbsd-6) also leads to hangs. (wm doesn't use workqueues, so it must be something else. But wm quad-port cards seem to have funky bridge chips that netbsd-5 at least doesn't handle.)
Attachment:
pgpeJPcm4QxVL.pgp
Description: PGP signature