locking/synchronization changes 4.99.66->now? (broken opencrypto)

To: tech-kern%netbsd.org@localhost, port-i386%netbsd.org@localhost, tech-crypto%netbsd.org@localhost
Subject: locking/synchronization changes 4.99.66->now? (broken opencrypto)
From: Thor Lancelot Simon <tls%rek.tjls.com@localhost>
Date: Thu, 6 Nov 2008 11:00:17 -0500

Darran Hunt and I have been trying to explain some extremely strange
crashes in opencrypto with 5.0_BETA or recent -current.  With a
DEBUG DIAGNOSTIC LOCKDEBUG kernel, the following test (using the software
backend):

        sysctl -w kern.cryptodevallowsoft=-1
        openssl speed -engine cryptodev -elapsed -evp des-ede3-cbc

will pretty reliably produce a panic on our 4- or 8-cpu amd64 and i386
machines.  The panic is a protection fault when the cv_signal() in
cryptoret() fires.  Further investigation shows that the condvar in
the active request appears to have been freed, and working around that
(by use of cv_valid) lets us get far enough to be reasonably sure that
the containing cryptoreq (the "crp") has been freed but is somehow
still on cryptoreq's queue.  The only other possibility we see (but
think we have mostly ruled out) is a pool allocator problem leading to
overlapping request structures.

Another very odd thing here is that of course the test above produces a
single stream of basically synchronous requests.  There should be little
or no concurrency to create potential problems.

We've been over the code with a fine-tooth comb and are pretty sure there
should be no way for the return queue to be manipulated or the code that
frees requests to be called except under the protection of crypto_mtx.

It is almost as if crypto_mtx weren't mutexing, specifically around the
TAILQ manipulation for the return queue -- it looks like the TAILQ calls
in cryptoret() get a stale TAILQ_HEAD that points at freed data.  We tried
putting membar_sync() before and after the traversal of the TAILQ in
cryptoret() and crypto_ret_q_remove() but this didn't help; I'm not sure
it should -- do these operations guarantee that *other* CPUs have all
pending loads/stores flushed?
 
What is most baffling is that if we run the exact 4.99.66 code in today's
kernel it breaks terribly in this way whereas in the 4.99.66 kernel it
works fine.  I can't see a change that should produce this.

The upshot is that opencrypto is totally broken on MP machines, we've
spent the best part of a week working on it and are really confused --
but what we're seeing concerns me because if there is not a bug we've
missed in opencrypto, much more could be broken.  Help?

Thor

Follow-Ups:
- Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)
  - From: Andrew Doran
- Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)
  - From: Bill Stouder-Studenmund
- Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)
  - From: Thor Lancelot Simon

Prev by Date: Re: amd64 panics w/XEN_DOM0, not GENERIC
Next by Date: Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)
Previous by Thread: NFS-related crashes
Next by Thread: Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)
Indexes:

Home | Main Index | Thread Index | Old Index