Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)

To: Thor Lancelot Simon <tls%rek.tjls.com@localhost>
Subject: Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)
From: Andrew Doran <ad%netbsd.org@localhost>
Date: Fri, 21 Nov 2008 22:04:41 +0000

On Thu, Nov 06, 2008 at 11:00:17AM -0500, Thor Lancelot Simon wrote:

>       sysctl -w kern.cryptodevallowsoft=-1
>       openssl speed -engine cryptodev -elapsed -evp des-ede3-cbc
> 
> will pretty reliably produce a panic on our 4- or 8-cpu amd64 and i386
> machines.  The panic is a protection fault when the cv_signal() in
> cryptoret() fires.

page fault I guess.

> Further investigation shows that the condvar in
> the active request appears to have been freed, and working around that
> (by use of cv_valid) lets us get far enough to be reasonably sure that
> the containing cryptoreq (the "crp") has been freed but is somehow
> still on cryptoreq's queue.  The only other possibility we see (but
> think we have mostly ruled out) is a pool allocator problem leading to
> overlapping request structures.
> 
> Another very odd thing here is that of course the test above produces a
> single stream of basically synchronous requests.  There should be little
> or no concurrency to create potential problems.
> 
> We've been over the code with a fine-tooth comb and are pretty sure there
> should be no way for the return queue to be manipulated or the code that
> frees requests to be called except under the protection of crypto_mtx.
> 
> It is almost as if crypto_mtx weren't mutexing, specifically around the

It's unlikely that the basic locking primitives are not working. The system
would fail in all kinds of ways. Can you say what kind of CPU you are using?
Does the problem occur if you take all but one cpu offline using cpuctl?

> TAILQ manipulation for the return queue -- it looks like the TAILQ calls
> in cryptoret() get a stale TAILQ_HEAD that points at freed data.  We tried
> putting membar_sync() before and after the traversal of the TAILQ in
> cryptoret() and crypto_ret_q_remove() but this didn't help; I'm not sure
> it should -- do these operations guarantee that *other* CPUs have all
> pending loads/stores flushed?

membar_sync() doesn't do anything with other CPUs. The call itself is only
defined to imply ordering, but on x86 it will flush eveything out to memory
(both loads and stores).

> What is most baffling is that if we run the exact 4.99.66 code in today's
> kernel it breaks terribly in this way whereas in the 4.99.66 kernel it
> works fine.  I can't see a change that should produce this.
> 
> The upshot is that opencrypto is totally broken on MP machines, we've
> spent the best part of a week working on it and are really confused --
> but what we're seeing concerns me because if there is not a bug we've
> missed in opencrypto, much more could be broken.  Help?

I will have a look over the opencrypto code but I'm not all that familiar
with it.

Thanks,
Andrew

References:
- locking/synchronization changes 4.99.66->now? (broken opencrypto)
  - From: Thor Lancelot Simon

Prev by Date: patch to support >16 bit g/uids on ext2fs
Next by Date: Re: patch to support >16 bit g/uids on ext2fs
Previous by Thread: Re: locking/synchronization changes 4.99.66->now? (broken opencrypto)
Next by Thread: interrupt storm on int19
Indexes:

Home | Main Index | Thread Index | Old Index