tech-crypto archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: GSoc2010 project suggestion: swcryptX



On Tue, Feb 23, 2010 at 12:36:18AM +0100, Hubert Feyrer wrote:
>
>> From my understanding of the code, opencrypto(9) spawns 1 kernel thread 
>> 
> which then handles the actual crypto requests (crypto.c, crypto_init0()  
> and cryptoret()). If a second opencrypto(9) call arrives while the first  
> one is being handled, it is queued, and processed later  
> (crypto_dispatch()).

You need to carefully follow a request down from userspace, into
cryptosoft, and back up.  It doesn't work the way you seem to think it
does.

One big hint of that would be that the only kernel thread involved in
the whole business at all is called "cryptoret".

The queues you're looking at are used for result return, not request
dispatch.  Requests are dispatched by invoking the driver's processing
methods via function pointer in crypto_invoke().

In the case of cryptosoft, this ends up running on the same CPU that
originally invoked the opencrypto machinery, unless it's been switched
away from.  Because all the entities involved are marked MPSAFE, this
means as many LWPs as you like can be running in cryptosoft at the same
time.  Your explanation of why this is not so is just wrong.

Here, look:

I have two CPU cores:
# cpuctl list
Num  HwId Unbound LWPs Interrupts     Last change
---- ---- ------------ -------------- ----------------------------
0    0    online       intr           Wed Jan  6 17:10:01 2010
1    1    online       intr           Wed Jan  6 17:10:01 2010

Here's how fast one core is purely in userspace:
# openssl speed --elapsed evp des-ede3-cbc
[...]
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
des-ede3-cbc     12629.10k    13063.08k    13136.54k    13128.63k    13191.85k

Here are those two cores, fed via 4 userspace processes:
# openssl speed -elapsed -evp des-ede3-cbc -multi 4
[...]
evp              23044.85k    27947.13k    26671.29k    30792.88k    30500.91k

Here is how fast it runs with one core, using cryptosoft via /dev/crypto:
# sysctl -w kern.cryptodevallowsoft=-1
        [careful to undo this before doing any more software crypto tests...!]
kern.cryptodevallowsoft: 1 -> -1
# openssl speed -elapsed -evp des-ede3-cbc -engine cryptodev
des-ede3-cbc      5176.52k     8921.87k    11116.44k    12024.35k    12258.06k

And here are both (2) cores, using cryptosoft via /dev/crypto:
evp               5637.00k    18884.52k    19146.77k    27567.69k    28165.78k

The huge difference in speed for small requests is the syscall overhead to
get the requests into and out of the kernel.  The small difference in speed
for large requests is because the DES implementation in opencrypto is better
than the one in cryptosoft -- it has asm, including asm for CBC mode.  But
it's clear that in fact cryptosoft is using both cores.

Of course, this won't let you use multiple cores to offload crypto
processing from IPsec, which I suspect is what you want to do, but
that's because our networking code is not MP safe and thus while
requests are being processed in cryptosoft, the rest of the network
stack, which invoked cryptosoft, can't run.

But this has nothing to do with threads nor request submission queues
because there aren't any of either in cryptosoft.

"Fixing" this would mean pretty fundamentally rewriting the cryptosoft
driver to make it queue requests internally, possibly maintain its own
sleepable entities, etc.  And that would probably harm its performance
for the cases where it works well now.  The hardware drivers _have to_
do these things because they have hardware resources to manage; cryptosoft
does not.

Perhaps we could provide an alternate cryptosoft implementation which
queues requests, to speed up IPsec on multi-CPU machines.  Attaching
multiple instances of _that_ might do what you want.

Another approach would be to look at the FAST_IPSEC code, which already
goes to great pains to be able to wait for requests when opencrypto does
queue them, and see if it could arrange to let other CPUs do packet
processing at those times.  I think that architecturally this is a better
solution, but someone who really understands the networking stack and is
not afraid of the FAST_IPSEC code really would have more a more useful
opinion here (Jonathan?  Arnaud?  Matt?).

Thor


Home | Main Index | Thread Index | Old Index