Re: cprng_fast implementation benchmarks

To: Thor Lancelot Simon <tls%panix.com@localhost>
Subject: Re: cprng_fast implementation benchmarks
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Mon, 21 Apr 2014 16:39:52 -0700

On 20 Apr, 2014, at 00:18 , Thor Lancelot Simon <tls%panix.com@localhost> wrote:
> Personally, I'd go with the HC128 variant without the ugly inlining.  We
> restrict bytes-on-key to 512MB -- the best current attack on HC128 requires
> 2^65 bytes to distinguish its keystream, so I think it's a reasonable choice
> at this time.  Percpu ARC4 is really a strawman; RC4 is just not safe at any
> speed any longer.  Let's test SALSA20 and discuss further.

The thing that bothers me about HC-128 is that its output is apparently not
very random.  This paper

    http://viper.eng.iastate.edu/gmdev/pubs/tantra.pdf

shows the number of failures it gets when run through the TestU01 batteries.
While one would need to see the p-values of the failures to know if these are
"kind of bad" or "really bad", the fact that even the Small Crush battery
is finding stuff (that battery takes under 10 seconds to run) is making
this seem, well, not good.  I personally might overlook this if there
were some offsetting advantages to HC-128, but the fact that the mediocre
random numbers it produces are being paid for with 4500 bytes of context is
making me think it can't be a good idea.

When Salsa20/ChaCha was mentioned I turned ChaCha into a PRNG library, available
here

    http://www.netbsd.org/~dennis/ccrand.tgz

if it is useful, and ran the three variations through TestU01 batteries.  
ChaCha8
produced two failures, one in Crush (about half an hour of CPU time to run) and
one in Big Crush (about 3 hours).  ChaCha12 produced a single failure, in Crush.
I fumble-fingered running ChaCha20 and had to redo that, but it has now run
everything except the last half hour of Big Crush with no failures so far.  For
comparison, arc4random() gets a couple of failures in Big Crush, the Mersenne
Twister (which had a good reputation as a fast, high quality non-cryptographic
PRNG) also gets a couple failures and ISAAC, my prior favorite PRNG, runs the
entire set of batteries clean even though it uses less than half the state space
of HC-128 (AES and SHA-1 also run clean, but more slowly).  Given that ChaCha
seems to be keeping pretty good company in terms of random number quality, and
manages to do this securely as well with only 128 bytes of state (i.e. 40 ChaCha
instances consume the same state space as one HC-128 instance), I am thinking
that this seems like a pretty good idea even if it turns out that ChaCha20 isn't
as easy on the CPU as one would prefer.

The reason I think minimizing the amount of state an instance of the PRNG 
requires
is important derives from something clearly demonstrated by your measurements,
that often the cost of computing something from shared data can be hugely 
dominated
not by the computation but rather by the sharing, in terms the cost of taking
locks or in having that data frequently invalidated from your cache or whatever
else you do to ensure mutual exclusion.  I think the approach you are taking,
of maintaining more instances of the data which are shared much less widely,
is exactly right, but I don't think per-CPU plus splhigh() is a recipe that 
works
long term.  It is correct to point out that the random number generator won't be
doing much at splhigh(), but what is incorrect is to assume that the random 
number
generator is the only thing that will need a solution to this problem.  If
per-CPU+splhigh() is the solution to all such problems then aggregate may end up
running at splhigh() a lot, even if each individual application doesn't do much,
and I don't think that is a good outcome.

So I think it might be useful if someone eventually figured out how to
eliminate the splhigh() in favor of even more instances shared even
less widely (but more compatibly), say per-CPU, per-interrupt-level or
even per-application.  If it turns out that having a large number of
random number instances is the best arrangement then the cost of having
a large number of instances becomes relevant.  ChaCha's 128 bytes of state
per instance makes it highly desirable, while HC-128's 4500 bytes per instance
seems a lot to pay for not very random numbers even if it computes them
quickly in benchmarks.

Dennis Ferguson

References:
- cprng_fast implementation benchmarks
  - From: Thor Lancelot Simon

Prev by Date: cprng_fast implementation benchmarks
Next by Date: cprng_fast and interrupts [was Re: Patch: cprng_fast performance - please review.]
Previous by Thread: cprng_fast implementation benchmarks
Next by Thread: Re: cprng_fast implementation benchmarks
Indexes:

Home | Main Index | Thread Index | Old Index