Re: cprng_fast implementation benchmarks

To: tech-kern%netbsd.org@localhost, tech-crypto%netbsd.org@localhost
Subject: Re: cprng_fast implementation benchmarks
From: Thor Lancelot Simon <tls%panix.com@localhost>
Date: Tue, 22 Apr 2014 23:59:38 -0400

On Sun, Apr 20, 2014 at 03:18:03AM -0400, Thor Lancelot Simon wrote:
> I have done some benchmarks of various cprng_fast implementations:
> 
>       arc4-mtx                The libkern implementation from
>                               netbsd-current, which uses a spin mutex to
>                               serialize access to a single, shared arc4
>                               state.
> 
>       arc4-nomtx              Mutex calls #ifdeffed out.  What was in
>                               NetBSD prior to 2012.  This implementation
>                               is not correct.
> 
>       arc4-percpu             New implementation of cprng_fast using percpu
>                               state and arc4 as the core stream cipher.
>                               Uses the arc4 implementation from
>                               sys/crypto/arc4, slightly modified to give an
>                               entry point that skips the xor.
> 
>       hc128-percpu            Same new implementation but with hc128 as the
>                               core stream cipher.  Differs from what I
>                               posted earlier in that all use of inline
>                               functions in the public API has been removed.
> 
>       hc128-inline            Percpu iplementation I posted earlier with all
>                               noted bugs fixed; uses inlines in header file
>                               which expose some algorithm guts to speed up
>                               cprng_fast32().

Three more:

        chacha8                 Percpu with Dennis' implementation of ChaCha, 8 
rounds.
        chacha12                12 rounds
        chacha20                20 rounds

RESULTS
 
 kernel         cpb (32 bit)    4GB (1 way)     16GB (4 ways)   Scaling Factor
 ------         ------------    -----------     -------------   --------------
 arc4-mtx       35              42.58           398.83          0.106
 arc4-nomtx     24              42.12           2338.92         0.018
 arc4-percpu    27              33.63           41.59           0.808
 hc128-percpu   21              23.75           34.90           0.680
 hc128-inline   19              22.66           31.75           0.713
 chacha8        22              20.51           30.45           0.662
 chacha12       24              24.87           34.32           0.724
 chacha20       28              30.45           39.28           0.775

I believe ChaCha8 is suitable for our purpose: we were previously considering
ciphers with, at most, 128-bit security, and even 6-round ChaCha has 139-bit
strength against the best currently known attack (at present, there is no
attack better than brute force on ChaCha8, and the best attack on ChaCha7
is 2^248).  ChaCha8 appears to be somewhat faster than the old arc4 
implementation.

I propose to collapse the relevant bits of Dennis' "ccrnd" into the subr_cprng.c
source file, configured for 8 rounds, and call it a day.

Thor

Follow-Ups:
- Re: cprng_fast implementation benchmarks
  - From: Paul_Koning
- Re: cprng_fast implementation benchmarks
  - From: Mindaugas Rasiukevicius
- Re: cprng_fast implementation benchmarks
  - From: Joerg Sonnenberger

References:
- cprng_fast implementation benchmarks
  - From: Thor Lancelot Simon

Prev by Date: Re: Towards design criteria for cprng_fast()
Next by Date: Re: cprng_fast implementation benchmarks
Previous by Thread: Re: cprng_fast implementation benchmarks
Next by Thread: Re: cprng_fast implementation benchmarks
Indexes:

Home | Main Index | Thread Index | Old Index