Shared resource limit implementation question

To: tech-kern%netbsd.org@localhost
Subject: Shared resource limit implementation question
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Wed, 19 Feb 2014 21:08:59 +0700
The kernel code for handling resource limits attempts to share the
limits structure between processes (wholly reasonable, as limits are
inherited, and rarely changed).  A shared limits struct (which is all
of them when a new one is created) is marked as !pl_writeable.
(Then when a process modifies one of its limits, it is given a copy
of the limits struct, marked pl_writeable that it can modify as needed).

All that is fine.

What confuses me, is this code in kern_fork (really in fork1() if you're
looking for it)...   (the same happens in posix_spawn() in kern_exec.c)

        p1_lim = p1->p_limit;
        if (!p1_lim->pl_writeable) {
                lim_addref(p1_lim);
                p2->p_limit = p1_lim;
        } else {
                p2->p_limit = lim_copy(p1_lim);
        }

p1 & p2 are the two processes of course (p1 is forking, p2 is its new child).

The first half of the "if" is the "normal" case - a shared limits struct just
has its ref count incremented, and is shared even more than it was before.

I don't understand the "else" case though - that is, I know it will work,
(and how) but it doesn't seem like the optimal way to run things.

What that does, is give p2 a new copy of the limits struct (and because
it is new, it is !pl_writeable and hence will be shared when p2 forks in the
future).   That is all OK.

Except, p1's limits struct isn't being shared with p2, and I cannot think
of any reason why it shouldn't be.

The effect of this strategy, it seems to me, is that once a process has
modified its limits struct, it is given a private copy, that is never
shared with anyone - processes that don't modify their limits structs
get to use shared limits structs (for all their descendants, and shared
with any ancestors that don't modify their limits - if the parent process
is a non-modifying process, the limits will also be shared with siblings).

The problem (deficiency, not bug) seems to me, is that the processes both
most likely to alter resources, and to fork, are shells (others do both of
course, but for shells, both are reasonably common).

As kern_fork (and for what it is worth, posix_spawn()) is implemented now,
the effect is that any shell (or other process) that alters its resource limits
causes all of its children to start with a new copy of the limits struct,
and hence, those siblings don't get to share it with each other, even though
none of them (probably) has any interest in touching the struct at all.

This seems to be optimised based upon the assumption that a process that
has modified its limits once, is going to do it again, and again, and
again, and so should keep a private copy of the limits struct, never shared
with anyone.

I don't think that assumption matches reality - more common (IMO) would be
for a process to modify a few of its limits, soon after startup, and then
usually never touch them again (or rarely touch them again).

The comment before the code I quote above says ...

         * Note: p_limit (rlimit stuff) is copy-on-write, so normally  

but what is there isn't really copy on write (that's elsewhere, in the
setrlimit() code - in kern_resource.c) what this is is "copy on has been
written" which is a technique that I don't think has been very well
researched.

The normal way to implement copy-on-write is to make a copy whenever a
write is done, and mark that copy shared (and unwriteable) whenever more
sharing is needed.

That is, I'd suggest that the code above should be more like ...

        p1_lim = p1->p_limit;
        lim_addref(p1_lim);
        p2->p_limit = p1_lim;
        p1_lim->pl_writeable = false;

That is, always share the limits struct between p1 & p2, and being shared,
mark it unwriteable.

Of course, some locking might be needed there, in the case where the
struct was writable before, to make sure some other process/thread isn't
half way through writing to it (I don't think that is possible, as if
pl_writeable is true, then the struct is not shared, it is/was private
to p1, and we know what p1 is doing - it is forking - so it cannot be also
modifying its resources.)   But that thinking is pre-kernel-thread style,
and it is possible (I guess) that one thread of p1 is modifying limits at
the same time as another thread is forking -- I'd hope that isn't a legal
operation sequence, but who knows...

If locking is needed, I'd probably keep the if test, and put the new code
(along with whatever locking is needed) in the else case, replacing the
lim_copy() that is there now, and leaving the !pl_writeable code the same
as it was,

The effect of this change, is that (like now) a process that modifies its
resources gets a new writable copy (if it didn't already have one), which
would then become shared with any future descendents.  If the process (after
forking) modifies its limits again, it simply gets another new (writable)
copy of its limits struct, which remains writable until the next time it
forks (and the whole process repeats).  It would seem to me this would result
in less resource structs being created, and so lower overall overhead.
The worst case would be a process that does...
        for(;;)
                if (fork()) setrlimit(....);
that is, changing limits after every fork - and even that is no worse than
now, it simply means that a new limits struct gets created in setrlimit()
every time, rather than in fork() every time, which is what happens now.

Is there something I am missing here that makes the way it is done now
a better choice?

The one potential issue I can see is the way pl_sv_limit is used - I haven't
analysed that enough to know if the change I'm suggesting would cause any
problems for it - the issue being, that currently, a process can only ever,
in its lifetime, have 2 limits structs - one shareable one it gets when it
is created, and one private one it is given when it first modifies the limits.
With the change I'm suggesting, that would no longer be true, a process could
potentially have a very large number of limits structs (only one active at
a time of course) through its lifetime,  The total number allocated in the
system should be less (or at least, not more) than currently - they'd just
be assigned differently.  So, if the code is changed as I suggest, look at
how pl_sv_limit is used, and potentially fix it - it might then be making
a list of structs, all of which need freeing, rather than never being more
than two (which is what I believe happens now - that is, pl_sv_limit is only
ever set in a writable limits struct, always refers to a shared limits struct,
and writable ones never become unwritable (shareable) again.)

While thinking about all of this, if any of you do, there's one other
issue (more minor potential performance gain) - currently if a process
has a shareable limits struct, and it modifies it, it is always given a
copy (a writeable copy).  (Note: the only difference between writable
and unwritable/shareable copies is the pl_writeable flag value - they
don't live in different kinds of memory or anything.)  It seems like that
should not be needed if pl_refcnt == 1 - that is, if the process that is
about to modify its resources had a private copy (by chance because there
had been no need yet to share it - the process has not forked) then it
could simply switch to being writeable, without being copied, couldn't it?

This optimisation is likely to be of more benefit if the change I suggested
above cannot be made for some reason - as things are now, quite a few
processes start with a shareable limits struct with refcnt==1.  If the
change I suggested is made, that would no longer be true, almost all
processes would start with refcnt==2 (or more), and so dealing with the
case of a process with refcnt==1 would be rarer, and so less important
to optimise.

kre

ps: the dictionary tells me that "writable" is correct, pl_writeable could
possibly be spell-corrected ... (but I would not bother.)
Follow-Ups:
- rethinking resource limits (was Re: Shared resource limit implementation question)
  - From: David Young
Prev by Date: Re: [Milkymist port] virtual memory management
Next by Date: Re: pcb offset into uarea
Previous by Thread: pcb offset into uarea
Next by Thread: rethinking resource limits (was Re: Shared resource limit implementation question)
Indexes:
Home | Main Index | Thread Index | Old Index