[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/49017: vfork does not suspend all threads
The following reply was made to PR kern/49017; it has been noted by GNATS.
From: Nico Williams <Nico.Williams%twosigma.com@localhost>
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Fri, 7 Apr 2017 16:35:04 +0000
Robert Elz <kre%munnari.OZ.AU@localhost> wrote:
> | I'll take it much further: it is fork() that is EVIL, and vfork()
> | that is GOOD.
> | Here's my rationale for such an extraordinary statement:
> | https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234
> All that shows is that fork() is (or can be) expensive, which is hardly
> news, nothing at all about evilness, in fact, the closest I can see that
> it comes are these sentences ...
It isn't just expensive. I didn't go into detail about fork-safety
issues, but those are real enough, and also unnecessary.
Fork-safety issues are a necessary result of sharing state from a
starting snapshot of it (which is easy enough to do if you're a small
shell, but quite difficult if you're a large process with many libraries
loaded, many unbeknownst to the original program).
> That's kind of like saying that Ferrari's are evil, because they cost
> too much if all you do is drive them grocery shopping once a week...
To me it's "evil" (I know, hyperbole; a lifeless tool can't really be
evil) because of the fork-safety issues. Admittedly I did not go into
much detail on those. The inherently-slow design does not exactly help
The Unix community has basically been saying "fork() good, vfork() bad"
for decades. In this we have been sorely mistaken. At the very least,
vfork() is not bad.
> If anything is "evil" from your text (IMO) it would be "But now processes
> tend to be huge" - that is the problem, not fork().
Certainly "slow" is not good. However, layering issues involving APIs
with shortcomings are a fact of life because we are too willing to
Layering issues in Java and similar are something else (legendary?); I
won't go into them. You might object to JVMs in the first place, so
let's not go there.
Layering issues in C can still be quite complex though! Here's one
- main program
-> getXbyY() name service switch
-> LDAP plugin
-> SASL GSS plugin
Here the main program source is simple-looking, but turns out to be
complex at run-time. "But use nscd!" Yes, but nscd itself looks like
Slight API deficiencies in various layers in this example mean that
passing down configuration, or intent to _exit() a fork() parent, and so
on, is basically impossible. One could open-code all of it to avoid
this. One could say "screw TLS, GSS, Kerberos, I'll use IPsec, and
open-code everything", but IPsec is actually the hardest of these
security protocols to use correctly, and anyways, open-coding everything
will a lot of take time and effort.
> But fork() permits
> if (fork() > 0)
Yes! This is true, this is very helpful, and you'll see I make use of
Nothing, mind you, really prevents vfork() from supporting the same,
except that the parent must block :(
Naturally, "if (avfork() > 0) _exit(0);" would be cheaper :)
One can also daemonize ("detach from tty", whatever) by doing vfork()
and then exec(self). That's effectively how one has to do such things
on Windows due to its lack of fork() (though perhaps now with their WSL
thing to support Ubuntu on Windows they now have a fork()??).
I've written code that does this, including open source code (e.g., in
> | Briefly: fork()'s copying and/or COW are terrible and would never have been
> | necessary had Dennis Ritchie et. al. thought of vfork()'s semantics.
> While I suspect that fork() is really Ken's, not Dennis's (irrelevant here)
> I kind of doubt that. First because neither of them is/was in any way
> deficient in their thinking (simply ignoring a possibility like that is
Oh, I certainly did not mean to imply that they were!
They are/were luminaries who gave us the best OS of its time, with the
best derivative lineage since. For this I am ever thankful.
That does not mean that they can't have made mistakes (e.g., the lack of
a "create time" for files!), including ones they simply would not have
recognized as mistakes then, but which perhaps later it turns out could
have been designed differently to stand the test of time.
> not something I would expect) and second, because fork(), expensive or not,
> is simply far more general than vfork().
We could have done without fork(). But we could not have done without a
monstrosity like CreateProcess() unless we had either fork() or vfork().
Better then to have fork() than not, but even better to have vfork() to
begin with. vfork() was a bit of brilliance that had to come from
outside New Jersey.
I speculate that the brilliance of fork() in the beginning lay in making
it easy to develop programs like shells by placing the critical process
spawning code in user-land as opposed to kernel-land.
> | Besides, fork() has a ton of safety issues (which I mostly
> | did not address in that gist,
> Nor anywhere else I have seen - I'm sure it is possible to write code
> badly enough that fork() would cause problems, (and it is certainly
> possible to make a mess using buffered I/O) but almost all of that is
> trivially overcome.
Is this PR right place to do this? (A bit late to ask that, I know.) I
promise to write up a gist about fork-safety some time soon.
The gist of it is this: sharing state based on a one-time snapshot +
shared file descriptors can be devilishly difficult, if not impossible
A classic example is PKCS#11 and cryptography APIs in general. Recall
the complex layering mentioned above: there may not be a way for the
code that calls fork() to re-setup state that cannot be shared.
PKCS#11 explicitly says that the child-side of fork() MUST call
C_Initialize() and lose all its previous state. This follows in part
because the API might internally communicate with a device (e.g., a TPM,
smartcard, other token) via a file descriptor, and it would be difficult
to have two processes communicate with said device over the same file
descriptor in non-atomic ways (the fd not being anything like a
Even if you arrange to re-open the device on the child side, your open
sessions will need to be re-logged-in!
Even if you arrange to establish new sessions by reference to old
sessions, some cryptographic primitives fail catastrophically when
So one can use pthread_atfork() (e.g., libpkcs11 in Illumos uses it to
automatically re-initialize on the child-side of fork()) to avoid a lot
of these issues, but again, suppose you want to do
if (fork() > 0)
But how do you indicate intent to continue with pre-fork() state in the
child and not the parent?
If the PKCS#11 / whatever state is buried N>2 layers deep then
indicating intent to exit the parent can be impossible to do.
Now, PKCS#11-using libraries could be made to use pthread_atfork() to
reestablish state on the child side of fork(), but again, some things
can't safely be reused, so intent to exit one or the other side of
fork() is critical.
We could have a variant of fork() that runs the pthread_atfork() child
handlers in the parent and the parent handlers in the child... but that
would have other weirdness.
So if you want to exit the parent, then the only thing that actually
works is this: fork() early, before complex state is setup.
This brings me to a related issue: daemon() is bad. It's bad because
either the parent exits before the child is ready or complex state must
be setup before daemon() that might not survive the fork(). Oops. A
decade ago in Solaris/Illumos we adopted an alternative design (which I
use in Heimdal now) where two functions are used: one that fork()s and
has the parent wait for the child to signal readiness, and the other
(executed in the child) that signals readiness:
daemon_prep(); /* Returns here in the child-side of fork(); waits in
read(2) on a pipe in the parent*/
* Tell the parent waiting inside daemon_prep() that the child is ready.
* The parent will exit. If we exit, the parent will notice and exit with
* an error.
This has no fork-safety issues because all the code with fork-safety
issues happens on the child-side of an early fork().
This is extremely convenient:
# kdc && echo ready
# kinit -k && echo yes
when you get the shell prompt back that means the service is either
running or failed to start. There is no way you can get the prompt back
and the service subsequently fails to start.
Whereas using daemon() this can happen:
# kdc && echo ready
# kinit || echo no
We adopted this approach in Solaris/Illumos because we replaced the SysV
init and inetd system with a new one (SMF) that understands
inter-service dependencies and does not want to start a service until
its dependencies are running. And that means needing to know precisely
that a service has started, and the way we do that is by having the
service's main program behave as described above.
(SMF also has a process grouping mechanism for representing
multi-process services. This is used to, among other things, detect
crashes of some such processes in order to restart the service.)
One need not like/adopt SMF in order to appreciate/adopt the
> | Now, vfork() is... clumsy because of the stack sharing silliness, but it
> | predates threads, so its authors probably did not realize that taking a
> | callback function and argument to run in a new stack would have been a
> | superior design
> When vfork() was designed, the total (guaranteed) address space was just
> 64KB (text, data, stack, all combined). Duplicating stacks (adding an
> extra stack - and if you want to be able to return in the child, it actually
> means copying the existing stack, while adjusting any self-referencing pointers
> that occur there) would have been laughed away as absurd.
The extra stack can be tiny, since one expects the child to
exec-or-exit.. But sure, I understand. OTOH, copying as in fork()
isn't exactly light on resource usage either!
> In another message Nico.Williams%twosigma.com@localhost (kind of) quotes me:
> | Robert Elz <kre%munnari.OZ.AU@localhost> wrote:
> | > [ description of vfork_into_fork() elided ]
> and then says...
> | That's a neat idea, but I don't think it's needed. I can't think of why I
> | would ever need it or any time that I could have used it.
> Maybe you never would have, but I know of one immediate use - that is /bin/sh
> Our sh [...]
Aha, thanks. I get that a fork-me-after-all system call would simplify
that shell. That seems like a valid use case indeed (even if there are
other ways to handle this).
> | What I really want is
> | pid_t avfork(int (*)(void *), void *);
> | which is like vfork() but allocates a new stack, calls the given callback
> | in it just like pthread_create() would, and does not stop any threads in
> | the parent, not even the one that called it.
> I have no objection to that, go ahead, write the code for it, and submit
> it, it sounds useful enough to consider at least.
> | Note that avfork() would have much the same constraints for the child as
> | vfork() does, except, naturally, that the avfork() child could return while
> | the vfork() child cannot.
> Return to what? You're having it execute a callback, are you saying that
> that function can return? Return to where exactly? And what does that
> mean? What would be the difference between
When main() returns, the program exits.
When the callback function in pthread_create() returns, the thread
Ditto with avfork(): when the callback returns, the child process exits.
> child = avfork(func, &sp);
> if ((child = avfork(&sp)) == 0) func();
func() has to run in a separate stack in order to avoid having to stop the
parent thread that called avfork(). Sharing a stack is the reason that
the vfork() parent must stop while the child goes on.
avfork() looks almost exactly like pthread_create() (minus pthread_attr_t).
> If there's none, why the need for the callback? If avfork() cannot
The callback is the function to call on a new stack in the child. Samd as with
pthread_create(), only creating a child process that shares the parent's
address space just like vfork().
avfork() is like a combination of pthread_create() and vfork().
> actually return in the child, so the second is not possible, then neither
> can func() right?
The func() is expected to execve() or _exit(), just like vfork()
children. But it can also return since it is a C function! And just
like main(), if it returns, the process (the child in this case) exits.
Main Index |
Thread Index |