Re: cprng_fast implementation benchmarks

To: tech-kern%netbsd.org@localhost
Subject: Re: cprng_fast implementation benchmarks
From: Thor Lancelot Simon <tls%panix.com@localhost>
Date: Wed, 23 Apr 2014 09:16:33 -0400

On Wed, Apr 23, 2014 at 10:57:59AM +0200, Joerg Sonnenberger wrote:
> On Tue, Apr 22, 2014 at 11:59:38PM -0400, Thor Lancelot Simon wrote:
> > I believe ChaCha8 is suitable for our purpose: we were previously 
> > considering
> > ciphers with, at most, 128-bit security, and even 6-round ChaCha has 139-bit
> > strength against the best currently known attack (at present, there is no
> > attack better than brute force on ChaCha8, and the best attack on ChaCha7
> > is 2^248).  ChaCha8 appears to be somewhat faster than the old arc4 
> > implementation.
> 
> Sounds wrong. When I measured Salsa20/8, it was ~3 times faster than
> RC4. Code can be found at
> http://www.netbsd.org/~joerg/arc4random_salsa.c.

That's a libc implementation -- and were you calling it for 32 bits at a
time, or bulk data?

In the kernel, called for 32 bits at a time, with the percpu datastructures
and the spl calls, chacha8 appears to be about 30% faster than arc4.  Called
for 256 bytes at a time with the additional overhead of copying those bytes
out to userspace, it appears to be about 40% faster.

Given that -- supposedly -- these ciphers can generate data at somewhere
between 8 and 12 cycles per byte even when implemented in C, though the
core cipher makes a not insignificant contribution to the total cost here
there are fixed overheads (the function calls; the percpu allocation and
spl overhead) that account for much of the total time.

Do we still have a compile-time way to check if the kernel (or port) is
uniprocessor only?  If so we should probably #ifdef away the percpu calls
in such kernels, which are probably for slower hardware anyway.

Without the data moves to userspace, of course the 256-byte case should
be more indicative of raw cipher performance but that wasn't the point
of that test; rather that test was meant to determine how well the
different alternatives scale out to additional CPUs.

Thor

Follow-Ups:
- Re: cprng_fast implementation benchmarks
  - From: Joerg Sonnenberger
- Re: cprng_fast implementation benchmarks
  - From: Manuel Bouyer

References:
- cprng_fast implementation benchmarks
  - From: Thor Lancelot Simon
- Re: cprng_fast implementation benchmarks
  - From: Thor Lancelot Simon
- Re: cprng_fast implementation benchmarks
  - From: Joerg Sonnenberger

Prev by Date: Re: cprng_fast implementation benchmarks
Next by Date: Re: cprng_fast implementation benchmarks
Previous by Thread: Re: cprng_fast implementation benchmarks
Next by Thread: Re: cprng_fast implementation benchmarks
Indexes:

Home | Main Index | Thread Index | Old Index