Re: Changes to make /dev/*random better sooner

To: Thor Lancelot Simon <tls%panix.com@localhost>
Subject: Re: Changes to make /dev/*random better sooner
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Wed, 9 Apr 2014 16:36:26 -0700

On 7 Apr, 2014, at 21:25 , Thor Lancelot Simon <tls%panix.com@localhost> wrote:
> 1) Avoid wraparound problems with delta estimator by making estimation
>   framework 64-bit.
> 
> 2) Adjust rnd_counter to always return a 64-bit value, accordingly.

Hi,

I'd really like to understand what problem is fixed by this.  It
seems to make the code more expensive (significantly so since it
precludes using timestamps in their cheapest-to-obtain form) but
I'm missing the benefit this cost pays for.

I think (but I could be wrong) that the core of the change involves
this

 /*
- * Use the timing of the event to estimate the entropy gathered.
+ * Use the timing/value of the event to estimate the entropy gathered.
  * If all the differentials (first, second, and third) are non-zero, return
  * non-zero.  If any of these are zero, return zero.
  */
-static inline u_int32_t
-rnd_estimate_entropy(krndsource_t *rs, u_int32_t t)
+static inline uint32_t
+rnd_delta_estimate(rnd_delta_t *d, uint64_t v, int64_t delta)
 {
-       int32_t delta, delta2, delta3;
-
-       /*
-        * If the time counter has overflowed, calculate the real difference.
-        * If it has not, it is simplier.
-        */
-       if (t < rs->last_time)
-               delta = UINT_MAX - rs->last_time + t;
-       else
-               delta = rs->last_time - t;
+       int64_t delta2, delta3;

-       if (delta < 0)
-               delta = -delta;
+       d->insamples++;

        /*
         * Calculate the second and third order differentials
         */
-       delta2 = rs->last_delta - delta;
+       delta2 = d->dx - delta;
        if (delta2 < 0)
                delta2 = -delta2;

-       delta3 = rs->last_delta2 - delta2;
+       delta3 = d->d2x - delta2;
        if (delta3 < 0)
                delta3 = -delta3;

-       rs->last_time = t;
-       rs->last_delta = delta;
-       rs->last_delta2 = delta2;
+       d->x = v;
+       d->dx = delta;
+       d->d2x = delta2;

        /*
         * If any delta is 0, we got no entropy.  If all are non-zero, we
        if (delta == 0 || delta2 == 0 || delta3 == 0)
                return (0);

+       d->outbits++;
        return (1);
 }

but I'm not seeing any problem with wrap around which is being
fixed.  If you have an unsigned 32 bit counter, and sample tc0
from the counter earlier and tc1 later, then the unsigned 32 bit
number of counter ticks which have elapsed between tc0 and tc1
will almost always be

    uint32_t tc0, tc1, delta_tc;
    [...]
    delta_tc = tc1 - tc0;

It needs no more code than that.  The only way that will fail to
be true is if the time between the samples is so long that the
counter wraps 32 bits, but I don't see a problem with that.  The
fastest hardware counter in a machine you can buy today will take
more than a second to wrap 32 bits, so if that happens frequently
you aren't getting much data from this source in any case.
Additionally, each datum you get from a counter like this, which is
sampled asynchronously at a very low rate compared to its period,
will be very entropy-rich, so the very worst thing that can happen
by truncating the difference to 32 bits seems to be that you
might grossly underestimate the entropy of the sample (and, with
a miniscule probability, maybe toss a sample that you should keep
because the wrap produced a delta_tc of 0), but since you are
always estimating the entropy at a constant 1 bit you are
essentially making a pessimal estimate all the time anyway.  This
change to 64 bit arithmetic will more than double the number of
instructions it takes to compute this on a 32 bit machine, but
all it seems to save here is a miniscule probability of tossing
a timestamp you should keep.  I feel like I've got to be missing
something.

That doesn't bother me, though.  I'm more concerned about the
implications of this change for stuff I would like to do, in
particular this consequence:

/*
- * Generate a 32-bit counter.  This should be more machine dependent,
- * using cycle counters and the like when possible.
+ * Generate a 64-bit counter.
  */
-static inline u_int32_t
+static inline uint64_t
 rnd_counter(void)
 {
-       struct timeval tv;
+       struct timespec ts;
+       uint64_t ret;

 #if defined(__HAVE_CPU_COUNTER)
-       if (cpu_hascounter())
-               return (cpu_counter32());
+       if (cpu_hascounter() && sizeof(cpu_counter() == sizeof(uint64_t))) {
+               return (cpu_counter());
+       }
 #endif
        if (rnd_ready) {
-               microtime(&tv);
-               return (tv.tv_sec * 1000000 + tv.tv_usec);
+               nanouptime(&ts);
+               ret = ts.tv_sec;
+               ret *= (uint64_t)1000000000;
+               ret += ts.tv_nsec;
+               return ret;
        }
        /* when called from rnd_init, its too early to call microtime safely */
        return (0);

In all cases where the hardware counter isn't 64 bits wide you
are instead acquiring a full, 64 bit time-of-day timestamp for
a computation which by its own estimate is going to produce 1 bit
of entropy.

What I would like to do is to make per-packet timestamps (which you
are doing already) more widely useful by moving where the timestamp is
taken closer to the input interrupt which signals a packet delivery
and carrying it in each packet's mbuf metadata.  This has a nice effect
on the rare occasion that the packet is delivered to a socket with a
consumer with cares about packet arrival times (like an NTP or IEEE 1588
implementation), but it can also provide a new measure of the performance
of the network code when making changes to it (how long did it take packets
to get from the input interface to the socket, or from the input interface
to the output interface?) which doesn't exist now.  In fact it would be
nice to have it right now so the effect of the changes you are making
could be measured instead of just speculating.  I was also imagining that
the random number machinery would harvest timestamps from the mbufs,
but maybe only after it is determined if the timestamp is one a user
is interested in so it didn't use those.

In any case, the acquisition of a time-of-day timestamp in pretty much
all systems is done like this:

(1) Read some time-varying bits out of a hardware register.

(2) Access conversion constants and do arithmetic on the bits to produce
    the time of day.

I have measured this quite a bit and found that, unless you have really
inconvenient hardware, most of the work of getting the time-of-day stamp
is in (2).  It isn't necessarily the arithmetic that is slow, the problem
also seems to be that the shared conversion constants you need are, on a
multiprocessor, frequently invalidated out of your cache by frequent
changes to the values, so (2) also increases your cache miss rate.  I would
also like to minimize the rate at which these values change, but that rate
will never go to zero.

So, thinking it would be a profligate waste of resources to do (2) for
every packet's tickstamp even if no user wanted the time-of-day stamp, I
figured out what it would take to just do (1), and store it, when
obtaining the tickstamp, and defer doing (2) until it was discovered
whether anyone wanted to see a time-of-day or not.  What I didn't imagine,
however, is that random number entropy harvesting could possibly want
more than that raw hardware value since the raw hardware value is all
the entropy there is.  Doing linear arithmetic on it with easily knowable
constants to produce a time of day doesn't change that at all, but costs
something that can be significant, so I couldn't see the point.  More
than this, on a 32 bit machine I didn't think it was worth while to ever
store more than 32 bits worth of hardware value since, even if the hardware
knew more than 32 bits, it is usually a little painful to fetch more than
32 of them if all you have are 32 bit loads.

I still believe the time-varying hardware sample represents all the entropy
that time sample can have (in general the value doesn't need to be a
free-running up counter either, it could be a value from an interrupt
down-counter with some associated software state).  Why can't you just
process arbitrary integer values known to vary with the sampling time?
The above looks like way too much work to me, and I don't understand what
the extra work is getting you.

Dennis Ferguson

Follow-Ups:
- Re: packet timestamps (was Re: Changes to make /dev/*random better sooner)
  - From: David Young
- Re: Changes to make /dev/*random better sooner
  - From: Thor Lancelot Simon

References:
- Changes to make /dev/*random better sooner
  - From: Thor Lancelot Simon

Prev by Date: Re: Vnode API change: add global vnode cache
Next by Date: Re: Vnode API change: add global vnode cache
Previous by Thread: Re: Patch: cprng_fast performance - please review.
Next by Thread: Re: Changes to make /dev/*random better sooner
Indexes:

Home | Main Index | Thread Index | Old Index