Re: Changes to make /dev/*random better sooner

To: Mindaugas Rasiukevicius <rmind%netbsd.org@localhost>
Subject: Re: Changes to make /dev/*random better sooner
From: Thor Lancelot Simon <tls%panix.com@localhost>
Date: Tue, 8 Apr 2014 22:33:32 -0400

On Wed, Apr 09, 2014 at 02:43:23AM +0100, Mindaugas Rasiukevicius wrote:
>
> This is a very positive work for the cases when system is used as a server
> or other workloads which may involve cryptographic activity.  However, you
> seem to assume that aggressive entropy collection is always preferable even
> if it has extra costs.

Can you give an example of this?  How exactly will the changes on this
branch measurably impact performance?  Can you demonstrate with a benchmark?

> Please make this tunable.  Also, please make this tunable *before* merge,
> otherwise this will just never happen.  Like cprng_fast/strong functions
> were never optimised to not use spin-lock (at least they no longer use
> IPL_HIGH, which should have been considered simply as a bug).

As I believe my benchmarks at the time amply demonstrated, the cprng_strong
function is considerably more efficient than the entropy extraction code
it replaced.  Can you please explain how this is an example of my supposed
disdain for performance?

Would you prefer the system to be slower, as it was, when user processes
ask for entropy, rather than faster, as I demonstrably made it?

Unfortunately, the arc4random() implementation which I renamed to cprng_fast
(almost two years ago, one should note) had a severe concurrency bug which
had real security implications.  I highlighted the use of a spin lock to fix
that code as an emergency measure and at the time, I asked you if you could
propose any other quick, simple fix.  I do not recall that you proposed one.

Another developer claimed at the time to have a lockless, concurrent
replacement for the arc4random() code almost ready to commit.  I won't
call him or her out publically but we both know who it is and if you're
going to pester someone about arc4random()/cprng_fast() I wish you'd pick
the right target -- not me.

Honestly I'm sick of getting beat up for other people's mistakes with
arc4random.  I think it's uncalled for.  I wish you'd stop.  I did not
even touch any cprng functions in the current batch of changes and you
are still complaining about them.  Doesn't seem right to me.

> Few fragments which caught my eye while skimming through the diff..
> 
> >  #if defined(__HAVE_CPU_COUNTER)
> > -   if (cpu_hascounter())
> > -           return (cpu_counter32());
> > +   if (cpu_hascounter() && sizeof(cpu_counter() == sizeof(uint64_t))) {
> > +           return (cpu_counter());
> > +   }
> >  #endif
> 
> ??

We provide no MI API for obtaining a counter value of any known size except
32 bits, unfortunately.  The instrumentation I added while developing these
changes revealed that the delta entropy estimator was terribly broken due
to wraparound; changing it to 64 bits is the fix.

The code already fell back to microtime() if it couldn't get a counter,
which is already the case on older platforms, so though this may look
odd, I think it is essentially a 64-bit version of precisely what we did
before.

It would be nice to be able to get at all -- even not the currently
selected -- timecounters directly in a non-horrible,
non-abstraction-violating way.  That might better address this as well as
your concerns about the clock-skew randomness source below.  I am hopeful
that the current discussion of clocks on tech-kern will lead to generic
functionality that could allow comparison of clocks as an entropy source
with less gyrations.

> > -           rnd_add_uint32(&skewsrc, rnd_counter());
> > -           callout_schedule(&skew_callout, hz);
> > +           rnd_add_uint64(&skewsrc, rnd_counter());
> > +           callout_schedule(&skew_callout, hz / 10);
> 
> Do we really need to run 10 extra callouts per second to get an extra bit?

Are you seriously telling me that you are concerned that adding 9 callouts
per second will measurably impact system performance?

I would submit that the change I made in my previous work on rnd, to which
you now seem to be retroactively objecting, to eliminate the constant use
of 1-tick callouts by the entropy addition code, almost certainly more than
makes up for any performance impact caused by adding 9 callouts per second
here.

> I did not investigate, but we have plenty of already existing callouts.
> Is, or can, flipflop logic be used with them?  Or deltas between *all*
> callouts in the system?

Do you really think that this kind of invasive change to the internals
of the callout mechanism will be cheaper?

> > +static void kprintf_rnd_get(size_t bytes, void *priv)
> > +{
> > +   if (mutex_tryenter(&kprintf_mtx)) {
> > +           if (kprnd_added) {
> > +                   SHA512_Final(kprnd_accum, &kprnd_sha);
> > +                   rnd_add_data(&rnd_printf_source,
> > +                                kprnd_accum, sizeof(kprnd_accum), 0);
> > +                   kprnd_added = 0;
> > +                   /* This, we must do, since we called _Final. */
> > +                   SHA512_Init(&kprnd_sha);
> > +                   /* This is optional but seems useful. */
> > +                   SHA512_Update(&kprnd_sha, kprnd_accum,
> > +                                 sizeof(kprnd_accum));
> > +           }
> > +           mutex_exit(&kprintf_mtx);
> > +   }
> > +}
> 
> Pre-checking kprnd_added first rather than locking/unlocking mutex all
> the time would be a start.. but how about not performing SHA-512 while
> holding IPL_HIGH mutex?  Not only embedded systems, but perhaps our VAX
> or m68k users would also prefer to not have micro-freezes every second
> while printing to a console.

Glad to move the check to kprnd_added.  Thanks.

However, can you actually demonstrate that this purported effect is
perceptible to humans?  We're talking about hashing *bytes of output
per minute* in the long run on most systems and perhaps a few hundred
bytes per second in extreme cases.

SHA512 may be slow, but even on ancient, slow CPUs, it does a lot better
than a few hundred bytes within the human threshold of perception of
several milliseconds.

I cannot perceive the effect of this code run in unaccellerated QEMU
on an old slow laptop, FWIW; I tested it in this way just to be sure.

I'm sure it is perceptible if the system is stuck in a tight loop
issuing printfs, but the system is *already* unusable in that case
because it is stuck at IPL_HIGH.  Note that to ensure debugging is
not impeded by this code it doesn't fire if we are printing from DDB.

> > +           kr = rnd_sources.lh_first;
> > +           start = rset->start;
> > +           while (kr != NULL && start > 1) {
> > +                   kr = kr->list.le_next;
> > +                   start--;
> > +           }
> 
> LIST_FIRST(), LIST_NEXT()?

Per what I understand to be our guidelines, I try not to mix this kind
of stylistic cleanup with more substantive changes.  The older parts of
the rnd code still have some very old constructs -- like this one, which
I moved.  I am glad to clean them up in another pass over the code.

Thor

Follow-Ups:
- Re: Changes to make /dev/*random better sooner
  - From: Mindaugas Rasiukevicius
- Re: Changes to make /dev/*random better sooner
  - From: Manuel Bouyer
- Re: Changes to make /dev/*random better sooner
  - From: Dave Huang

References:
- Changes to make /dev/*random better sooner
  - From: Thor Lancelot Simon
- Re: Changes to make /dev/*random better sooner
  - From: Mindaugas Rasiukevicius

Prev by Date: Re: Changes to make /dev/*random better sooner
Next by Date: Re: Changes to make /dev/*random better sooner
Previous by Thread: Re: Changes to make /dev/*random better sooner
Next by Thread: Re: Changes to make /dev/*random better sooner
Indexes:

Home | Main Index | Thread Index | Old Index