Re: sysctl_doeproc() race

To: Kamil Rytarowski <n54%gmx.com@localhost>
Subject: Re: sysctl_doeproc() race
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Tue, 13 Mar 2018 15:17:07 +0700

    Date:        Tue, 13 Mar 2018 03:27:50 +0100
    From:        Kamil Rytarowski <n54%gmx.com@localhost>
    Message-ID:  <24955bcc-ce92-150f-75cc-60bc62287085%gmx.com@localhost>

  | The approach with iterating over pid_table[] didn't work well for me and
  | I've abandoned it quickly.

I am not surprised at that.

  | I've been testing a patch splitting KERN_PROC_PID out into a new
  | branch/function:

  | My impression is that these changes make the code much less readable,

Agreed, though it is usually possible to improve that by refactoring and
similar (but most likely not worth the effort here.)

  | and the performance gain.... was negligible or even negative (!) in a
  | regular running systems with 20-30 processes. I expect that such
  | interface is clearly not a quick-path.

Not a surprise there either - where it might make a difference is when
the number of processes is more in the thousands.   I'm not sure how
20-30 processes counts as a "regular running system" though, my
laptop has > 130 processes at the minute, and it is more or less idling
(replying to this e-mail is its current real task) - other times it wil be
doing a build, perhaps with -j4 or -j8, and be running many more.

  | I was testing the proposal by Christos swapping zombie and allproc lists
  | and I've found that this code is stable for the original problem.

It will certainly arrange that you not find a process twice, at the expense
of making it more likely that you not find a process at all.

I am still not sure that I like that tradeoff - given that we have 
unpredictable behaviour, I still think that always including all
processes observed (even if that means sometimes some are
returned twice)  is the right solution, rather than sometimes
omitting some that could have been returned (procs which
existed before the sysctl etarted and stil exist after it is done.)

In your message yesterday, that I did not reply to (had not
had time to do so yet, and now will not, other than this) you
said ...

  | I expect an interface like sysctl(3) to get a quick snapshot of the current
  | system, 

and I think that's a faulty expectation - to generate a
snapshot it would need to freeze everything (and I
mean everything) while it copies out all the data to
user space.    The only mechanism I have ever seen
which could do that for the proc table was an ancient
"table driver" for 6th edition vintage systems - and it
worked only because:
	there were no multiprocessors
	there was no paging
hence the system call to read the entire contents of
the process (or file, or tty, mount, ...) table would be
guaranteed to not block (to be able to do the read, the
process had to be resident, and running, and if it was
resident, all of it was, so no waiting for pages to be brought
back), and as it was in control of the CPU, and in the
kernel, no premption could happen so that nothing
would change anywhere while it was running.
Of course as soon as the sys call was done, and the
data in user space, it could have become "incorrect"
even in that system, as even then there was no
guarantee a context switch might not happen before
return to user level, so before the reading process got
to examine the data, everything might have altered.
But it was a snapshot of an instant of time in the past.

There is no snapshot now, real, or intended - just a
(slow, or at least, sluggish) scan of a changing data
structure, with everything that implies about the
consistency of the data returned.    What's more, we
do not want more than that - the uses for this (which
is any random process deciding to fetch all processes)
should not be permitted to slow down the system.
The only way to make a snapshot type of result would
be to lock out every possible change for the duration
of the sys call - everything else just freezes.

[Later quotes are again back from the message I am
replying to.]

Note that to the caller there are even more races than is
apparent here, the sysctl can always return processes that
no longer exist by the time the user process gets to examine
the results, and new processes that could have been returned
may now exist - just did not when the list was scanned.
None of that is an issue to be "fixed".

  | I've included in this patch additionally:
  |  - Short-circuit break for KERN_PROC_PID.

This alone would have fixed the original problem with
getting the process returned twice (or the kernel attempting
to make that happen, finding iinsufficient user buffer space
provided, and returning ENOMEM).

  |  - Removal of redundant "if (kbuf)" and "if (marker)" checks.
  |  - Update of comments regarding potential optimization, explaining why
  | we don't want it.
  |  - ESRCH returned for badly written software, that is not prepared to
  | see no result (or see race). This also makes the case of replying a call
  | easier to detect.

All that is fine.

  | I've not observed any ATF regression in local tests

For your ptrace() tests, I would not expect that - while the
sysctl interface allows for truly unrelated processes to perform
proc table scans, and while the processes in your tests are
technically unrelated (to the kernel they seem that way), they're
not so much, as the process doing the scan is more or less in
control of the others of interest, telling them what to do by
sending them messages - so, for example, the situation of
a process id having been reused by another process (the
sysctl returns a process, but it is not the one you were
looking for) should be impossible in those scenarios (assuming
there are no message sending bugs and the code synchronisation
is properly designed).

Are there any other ATF tests that use this sysctl?    Clearly
nothing else should be affected by altering the way it works...

What I'd suggest is adding one more test, where you create
processes much like in the ptrace tests, but without any ptrace
involvement, where the child is simply fork/wait (with random
small delays - of the order of milliseconds - between those steps)
and its child simply does an exit after a small random delay,
with the 1st (direct) sending messages containing the pid
of the current grandchild to the ATF created process which
will be in a loop doing continuous sysctl() looking for that
exact process.

If it ever finds the process after having been told it does not
exist (ESRCH) I think that's a problem (it wil not start looking
for a pid until after it has been created, when a message is received
telling it that, and will stop looking for that pid after a new grandchild
exists, implying that the first no longer exists - so unless the kernel
uses the same pid for 2 successive grandchildren, which I think
we should prevent happening if it is even a remote possibility now)
the only way we should get found/ESRCH/found is if the ESRCH
happened as the sysctl scan occurred just when the process was
moving from allproc to zombproc (so the 2nd "found" is of the
zombie now located on zombproc).

I also suspect that if you test again now, with the sole change to
the current code being that the order of the list scans be
reverted to the way it was (allproc and then zombproc) that
the issues you were having with the ptrace() tests will all be
gone.

kre

Follow-Ups:
- Re: sysctl_doeproc() race
  - From: Robert Elz
- Re: sysctl_doeproc() race
  - From: Kamil Rytarowski

References:
- Re: sysctl_doeproc() race
  - From: Kamil Rytarowski
- Re: sysctl_doeproc() race
  - From: Kamil Rytarowski
- Re: sysctl_doeproc() race
  - From: Martin Husemann
- Re: sysctl_doeproc() race
  - From: Kamil Rytarowski
- sysctl_doeproc() race
  - From: Kamil Rytarowski
- Re: sysctl_doeproc() race
  - From: Christos Zoulas
- Re: sysctl_doeproc() race
  - From: Robert Elz
- Re: sysctl_doeproc() race
  - From: Kamil Rytarowski
- Re: sysctl_doeproc() race
  - From: Robert Elz
- Re: sysctl_doeproc() race
  - From: Robert Elz
- Re: sysctl_doeproc() race
  - From: Kamil Rytarowski

Prev by Date: Re: sysctl_doeproc() race
Next by Date: Re: sysctl_doeproc() race
Previous by Thread: Re: sysctl_doeproc() race
Next by Thread: Re: sysctl_doeproc() race
Indexes:

Home | Main Index | Thread Index | Old Index