[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
cprng_fast implementation benchmarks
I have done some benchmarks of various cprng_fast implementations:
arc4-mtx The libkern implementation from
netbsd-current, which uses a spin mutex to
serialize access to a single, shared arc4
arc4-nomtx Mutex calls #ifdeffed out. What was in
NetBSD prior to 2012. This implementation
is not correct.
arc4-percpu New implementation of cprng_fast using percpu
state and arc4 as the core stream cipher.
Uses the arc4 implementation from
sys/crypto/arc4, slightly modified to give an
entry point that skips the xor.
hc128-percpu Same new implementation but with hc128 as the
core stream cipher. Differs from what I
posted earlier in that all use of inline
functions in the public API has been removed.
hc128-inline Percpu iplementation I posted earlier with all
noted bugs fixed; uses inlines in header file
which expose some algorithm guts to speed up
All implementations were modified to rekey every 512MB of output and
to use the CPU cycle counter to time the generation of 4MB using
1,000,000 calls to cprng_fast32() after each rekeying.
All implementations were tested on a SuperMicro rackmount server with
an H8QC8+ motherboard, 4 dual-core Opteron 880 CPUs, and 32GB of
DDR400 RAM. System memory bandwidth was measured at 7,732 MB/sec
using the STREAM benchmark. Memory and cache latencies were not
All implementations were run in a GENERIC kernel compiled with
"no options DIAGNOSTIC".
The cpb and single-stream tests were run 3 times per kernel and the
median result chosen; the 4-stream test was run once per kernel
due to the very long runtimes for some cases.
The following tests were run:
1) Cycles-per-byte, 32 bit request: at initial keying of
the generator and each automatic rekeying.
2) Generate 4GB of data by requesting 256 bytes at a time
from userspace using a small C program that loops around
the kern.arandom sysctl call. Values are in seconds.
3) Generate 16GB of data by running 4 copies of the generator
program in parallel. Values are in seconds.
kernel cpb (32 bit) 4GB (1 way) 16GB (4 ways) Scaling Factor
------ ------------ ----------- ------------- --------------
arc4-mtx 35 42.58 398.83 0.106
arc4-nomtx 24 42.12 2338.92 0.018
arc4-percpu 27 33.63 41.59 0.808
hc128-percpu 21 23.75 34.90 0.680
hc128-inline 19 22.66 31.75 0.713
The immediate surprise here is the performance of the "arc4-nomtx" variant:
it is almost as fast as the percpu HC128 for a single stream of 32 bit requests
but an order of magnitude worse than "arc4-mtx" for 4 streams in parallel.
This is likely because of severe memory and cache contention caused by
multiple CPUs attempting to update the (improperly) "shared" ARC4 state in
parallel. The mutex-laden variant, "arc4-mtx", performs worse than everything
else when run in parallel -- but still an order of magnitude better than the
original, mutex-free arc4.
Optimization is hard. Let's go shopping!
The performance of arc4-percpu shows that, for a uniprocessor generating
a single stream of requests, the cost of the per-cpu data references
and bookkeeping are about 15% for short requests, which are the worst
case. Perhaps an explicitly uniprocessor variant would be justified in
All HC128 variants are faster than all ARC4 variants.
The percpu ARC4 variant scales better than either percpu HC128 variant,
likely due to better cache efficiency of its smaller state array,
but is still slower in absolute terms when run 4-way on this system;
almost 25% slower than the variant of HC128 which uses inlines.
Inlining improves the performance of the percpu HC128 by about 10%, but
this is extremely sensitive to factors which are difficult to control,
such as the quality of compiler cache blocking for the target architecture
and CPU model. In fact, with the code compiled for a generic x86-64 CPU,
defining DIAGNOSTIC, which adds a single not-taken branch prior to every
invocation of the HC128 cipher in the cycles-per-byte test case, *reduces*
cycles-per-byte to 17 or 18. This is an area for further work.
Due to the additional implementation complexity required to buffer their
output, ciphers from the SALSA family were not tested. Details at 11.
Personally, I'd go with the HC128 variant without the ugly inlining. We
restrict bytes-on-key to 512MB -- the best current attack on HC128 requires
2^65 bytes to distinguish its keystream, so I think it's a reasonable choice
at this time. Percpu ARC4 is really a strawman; RC4 is just not safe at any
speed any longer. Let's test SALSA20 and discuss further.
We should run some of these benchmarks on big-endian systems and systems
that are architecturally unlike the OOO x86-64 I used for my tests.
Main Index |
Thread Index |