Is there a current known getrusage() problem (amd64)

To: tech-kern%netbsd.org@localhost
Subject: Is there a current known getrusage() problem (amd64)
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Wed, 04 Apr 2018 23:23:11 +0700

While attempting to modify cgdconfig to act like wiz asked,
in order to test my mods, I needed to make some cgd's in
my test system (which usually has such thing).

I started with the standard

	cgdconfig -g -o /tmp/params aes-cbc 192

(more or less straight from the cgdconfig man page).   This
doesn't go near the code I changed (for configuring a cgd,
note: no cgd is mentioned on that cmd line) - but hung.

Upon investigation (digging into this has taken far longer than
the actual cgdconfig mods I was making...) I discovered that
it was looping doing getrusage() sys calls, and nothing else,
with the sys calls becoming further and further apart in time.

The only getrusage() in cgdconfig is in pkcs5_pbkdf2_time()
which contains no loops, but is called from a loop ...

                for (c = 1;; c *= 2) {
                        t = pkcs5_pbkdf2_time(dkLen, c);
                        if (t > CAL_TIME)
                                break;
                }

in pkcs5_pbkdf2_calibrate()

The terminator, 't' (the return value from pkcs5_pbkdf2_time())
is calculated as ...

        return (end.ru_utime.tv_sec - start.ru_utime.tv_sec) * 1000000
             + (end.ru_utime.tv_usec - start.ru_utime.tv_usec);

where the earlier part of the function does:

        (void)getrusage(RUSAGE_SELF, &start);
		/* do work that depends upon ''c' - the 2nd param */
        (void)getrusage(RUSAGE_SELF, &end);

Just from the ktrace output (ktruss actually) I could see the amount
of work increasing, and SIGINFO (^T) was telling me that it was
using lots of CPU time (100% of a cpu).   So clearly the rusage
ru_utime should have been getting bigger and bigger.   Eventuially, one
would think, that should have been more than CAL_TIME (which is
just 30ms - I was observing several -> many seconds of time being
consumed.)

So, I added some printf's to pkcs5_pbkdf2_time to find out what is happening
and saw this to start ...

pkcs5_pbkdf2_time(24, 1): 10 us (0.860 -> 0.870)
pkcs5_pbkdf2_time(24, 2): 5 us (0.874 -> 0.879)
pkcs5_pbkdf2_time(24, 4): 9 us (0.882 -> 0.891)
pkcs5_pbkdf2_time(24, 8): 17 us (0.894 -> 0.911)
pkcs5_pbkdf2_time(24, 16): 34 us (0.913 -> 0.947)
pkcs5_pbkdf2_time(24, 32): 66 us (0.949 -> 0.1015)

First, ignore the printf weirdness (0.860 does not mean
just under one second, but 0 seconds and 860 microseconds,
the n.m format there is just seconds.microseconds, printed,
unpadded ... that's just a diagnostic printf stupidity (mine,
just added) but isn't worth fixing.

The values there are the params to the call (dkLen and c) and
shows c growing (exponentially as expected).   Then there is the
return value, followed by the data from which it is calculated - the
process' user time, in seconds and microseconds, start -> end. 
So in the first line, 860 us had been consumed in the prog before
the call (the start rusage.ru_utime), the "work" was done, after
which 870 microseconds had been consumed, resulting in
10 us for the work.   Good.

Well, I question the point of starting quite so smalll (c==1)
which seems a bit pointless - but that's not the problem.

The following lines show similar data as c gets bigger.
The oscillation of the time with the small c values will just
be tick measurement issues (how long it seems to take
depends upon just when the clock ticks occur).   That's
all fine.

Then the next step after those above:

pkcs5_pbkdf2_time(24, 64): -1017 us (0.1017 -> 0.0)

Somehow, getrusage(RUSAGE_SELF, &end) has resulted in 0
for the user time (which then results in negative consumed user
time - or that's what is calculated.)

I did include tests on the results of the getrusage() calls, just in
case they were failing (and the errors being ignored by the
(void) casts) but no failure was reported.

Note in this it is the 2nd getrusage() call that has bad values,
and that is the last thing done before the printf(), so it is not
a case of the "work" overwriting the rusage data, or anything
like that - the start values (from before the work) are good,
the end ones have vanished.

After that we see ...

pkcs5_pbkdf2_time(24, 128): 0 us (0.0 -> 0.0)
pkcs5_pbkdf2_time(24, 256): 0 us (0.0 -> 0.0)
pkcs5_pbkdf2_time(24, 512): 0 us (0.0 -> 0.0)
pkcs5_pbkdf2_time(24, 1024): 0 us (0.0 -> 0.0)
pkcs5_pbkdf2_time(24, 2048): 0 us (0.0 -> 0.0)
pkcs5_pbkdf2_time(24, 4096): 0 us (0.0 -> 0.0)
pkcs5_pbkdf2_time(24, 8192): 0 us (0.0 -> 0.0)
pkcs5_pbkdf2_time(24, 16384): 0 us (0.0 -> 0.0)

and that just continues forever, with the iteration count (c)
getting higher and higher, and the actual time taken getting
longer and llonger - and the getrusage() user time resuilt
remaining 0.

Does anyone recognise this, or have any suspicions as
to what might be causing it?

I am running this in a XEN DomU amd64 with everything
updated and rebuilt from sources just a few hours old
(more or less, update, build, test again, send this e-mail...)
(I originally saw this with sources from a couple of days
ago, but thought I should update and try again, just in case.)

Alternatively can anyone else running current (particularly
a very up to date current, on amd64 (or others) and as a
XEN DomU) run the cgdconfig command above successfully ?

(running it is harmless, all it does is write cgd config data to
the file named - no actual cgds are needed for this.)

kre

Follow-Ups:
- re: Is there a current known getrusage() problem (amd64)
  - From: matthew green
- Re: Is there a current known getrusage() problem (amd64)
  - From: Robert Elz
- Re: Is there a current known getrusage() problem (amd64)
  - From: Michael van Elst

Prev by Date: Re: Potential new syscall
Next by Date: graphic drivers and simplefb
Previous by Thread: Potential new syscall
Next by Thread: Re: Is there a current known getrusage() problem (amd64)
Indexes:

Home | Main Index | Thread Index | Old Index