Re: cprng_fast implementation benchmarks

To: Thor Lancelot Simon <tls%panix.com@localhost>
Subject: Re: cprng_fast implementation benchmarks
From: Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost>
Date: Wed, 23 Apr 2014 05:02:33 +0000

   Date: Sun, 20 Apr 2014 03:18:03 -0400
   From: Thor Lancelot Simon <tls%panix.com@localhost>

           The following tests were run:

                   1) Cycles-per-byte, 32 bit request: at initial keying of
                      the generator and each automatic rekeying.

                   2) Generate 4GB of data by requesting 256 bytes at a time
                      from userspace using a small C program that loops around
                      the kern.arandom sysctl call.  Values are in seconds.

                   3) Generate 16GB of data by running 4 copies of the generator
                      program in parallel.  Values are in seconds.

It's good to confirm by these tests that going per-CPU didn't hurt
single-threaded bulk throughput and improved scalability -- so much
that just about any halfway reasonable choice of stream cipher would
do better than what we have now or had before.

What these tests don't measure is the impact on overall system
performance, especially via the CPU cache.  I don't expect going
per-CPU to hurt cachewise, but the difference between RC4's 256-byte
state and HC-128's 4 KB state may be noticeable in, say, build.sh, or
perhaps high network load triggering cprng_fasts in ALTQ or IPsec or
something.

   Let's test SALSA20 and discuss further.

I wrote a little code to implement the cprng_fast API using Salsa20 or
ChaCha with buffer for a single cipher core output.  The code, along
with a little driver to measure single-threaded speed and test vectors
to ensure we are computing Salsa20 and ChaCha, is temporarily at

<https://www.NetBSD.org/~riastradh/tmp/20140421/cprng_fast/>.

The entire state fits in two 64-byte cache lines -- half what RC4
took.  On my Ivy Bridge i7 laptop, I get ~4 cpb throughput in bulk,
average of ~50 cycles for each cprng_fast32, and maximum of ~300
cycles.

(If you need help reproducing my results, let me know.  This is all a
userland mockup, of course, but adapting it to the real kernel
cprng_fast shouldn't be hard.)

This implementation also has the property that the maximum work it
does with interrupts blocked is a single cipher core and copying at
most 64 bytes, plus some additions/subtractions to count things and
maybe schedule a softint.

It services long requests by grabbing a temporary key from the buffer
while interrupts are blocked, and then generating the rest from the
temporary key, which works well because Salsa20 and ChaCha have zero
key setup cost.

   We should run some of these benchmarks on big-endian systems and systems
   that are architecturally unlike the OOO x86-64 I used for my tests.

If I find time I can do some tests on some littleish (by modern
standards) arm and powerpc systems, but I'll have to defer to others
for the VAXen and toasters.

References:
- cprng_fast implementation benchmarks
  - From: Thor Lancelot Simon

Prev by Date: Re: cprng_fast implementation benchmarks
Next by Date: Re: Towards design criteria for cprng_fast()
Previous by Thread: Re: cprng_fast implementation benchmarks
Indexes:

Home | Main Index | Thread Index | Old Index