NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/49017: vfork does not suspend all threads



The following reply was made to PR kern/49017; it has been noted by GNATS.

From: Nico Williams <Nico.Williams%twosigma.com@localhost>
To: <gnats-bugs%netbsd.org@localhost>
Cc: 
Subject: Re: kern/49017: vfork does not suspend all threads
Date: Fri, 7 Apr 2017 16:35:04 +0000

 Robert Elz <kre%munnari.OZ.AU@localhost> wrote:
 >   |  I'll take it much further: it is fork() that is EVIL, and vfork()
 >   |  that is GOOD.
 >   |  
 >   |  Here's my rationale for such an extraordinary statement:
 >   |  https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234
 > 
 > All that shows is that fork() is (or can be) expensive, which is hardly
 > news, nothing at all about evilness, in fact, the closest I can see that
 > it comes are these sentences ...
 
 It isn't just expensive.  I didn't go into detail about fork-safety
 issues, but those are real enough, and also unnecessary.
 
 Fork-safety issues are a necessary result of sharing state from a
 starting snapshot of it (which is easy enough to do if you're a small
 shell, but quite difficult if you're a large process with many libraries
 loaded, many unbeknownst to the original program).
 
 >     [...]
 > 
 > That's kind of like saying that Ferrari's are evil, because they cost
 > too much if all you do is drive them grocery shopping once a week...
 
 To me it's "evil" (I know, hyperbole; a lifeless tool can't really be
 evil) because of the fork-safety issues.  Admittedly I did not go into
 much detail on those.  The inherently-slow design does not exactly help
 either.
 
 The Unix community has basically been saying "fork() good, vfork() bad"
 for decades.  In this we have been sorely mistaken.  At the very least,
 vfork() is not bad.
 
 > If anything is "evil" from your text (IMO) it would be "But now processes
 > tend to be huge" - that is the problem, not fork().
 
 Certainly "slow" is not good.  However, layering issues involving APIs
 with shortcomings are a fact of life because we are too willing to
 re-use code.
 
 Layering issues in Java and similar are something else (legendary?); I
 won't go into them.  You might object to JVMs in the first place, so
 let's not go there.
 
 Layering issues in C can still be quite complex though!  Here's one
 case:
 
  - main program
    -> getXbyY() name service switch
     -> LDAP plugin
      -> OpenSSL
       -> SASL
        -> SASL GSS plugin
         -> GSS-API
          -> Kerberos
           -> OpenSSL
 
 Here the main program source is simple-looking, but turns out to be
 complex at run-time.  "But use nscd!"  Yes, but nscd itself looks like
 this internally.
 
 Slight API deficiencies in various layers in this example mean that
 passing down configuration, or intent to _exit() a fork() parent, and so
 on, is basically impossible.  One could open-code all of it to avoid
 this.  One could say "screw TLS, GSS, Kerberos, I'll use IPsec, and
 open-code everything", but IPsec is actually the hardest of these
 security protocols to use correctly, and anyways, open-coding everything
 will a lot of take time and effort.
 
 > But fork() permits
 > 
 >    if (fork() > 0)
 >        _exit(0);
 
 Yes!  This is true, this is very helpful, and you'll see I make use of
 this myself.
 
 Nothing, mind you, really prevents vfork() from supporting the same,
 except that the parent must block :(
 
 Naturally, "if (avfork() > 0) _exit(0);" would be cheaper :)
 
 One can also daemonize ("detach from tty", whatever) by doing vfork()
 and then exec(self).  That's effectively how one has to do such things
 on Windows due to its lack of fork() (though perhaps now with their WSL
 thing to support Ubuntu on Windows they now have a fork()??).
 
 I've written code that does this, including open source code (e.g., in
 Heimdal).
 
 >   |  Briefly: fork()'s copying and/or COW are terrible and would never have been
 >   |  necessary had Dennis Ritchie et. al. thought of vfork()'s semantics.
 > 
 > While I suspect that fork() is really Ken's, not Dennis's (irrelevant here)
 > I kind of doubt that.  First because neither of them is/was in any way
 > deficient in their thinking (simply ignoring a possibility like that is
 
 Oh, I certainly did not mean to imply that they were!
 
 They are/were luminaries who gave us the best OS of its time, with the
 best derivative lineage since.  For this I am ever thankful.
 
 That does not mean that they can't have made mistakes (e.g., the lack of
 a "create time" for files!), including ones they simply would not have
 recognized as mistakes then, but which perhaps later it turns out could
 have been designed differently to stand the test of time.
 
 > not something I would expect) and second, because fork(), expensive or not,
 > is simply far more general than vfork().
 
 We could have done without fork().  But we could not have done without a
 monstrosity like CreateProcess() unless we had either fork() or vfork().
 Better then to have fork() than not, but even better to have vfork() to
 begin with.  vfork() was a bit of brilliance that had to come from
 outside New Jersey.
 
 I speculate that the brilliance of fork() in the beginning lay in making
 it easy to develop programs like shells by placing the critical process
 spawning code in user-land as opposed to kernel-land.
 
 >   |  Besides, fork() has a ton of safety issues (which I mostly
 >   |  did not address in that gist, 
 > 
 > Nor anywhere else I have seen - I'm sure it is possible to write code
 > badly enough that fork() would cause problems, (and it is certainly
 > possible to make a mess using buffered I/O) but almost all of that is
 > trivially overcome.
 
 Is this PR right place to do this?  (A bit late to ask that, I know.)  I
 promise to write up a gist about fork-safety some time soon.
 
 The gist of it is this: sharing state based on a one-time snapshot +
 shared file descriptors can be devilishly difficult, if not impossible
 to do.
 
 A classic example is PKCS#11 and cryptography APIs in general.  Recall
 the complex layering mentioned above: there may not be a way for the
 code that calls fork() to re-setup state that cannot be shared.
 
 PKCS#11 explicitly says that the child-side of fork() MUST call
 C_Initialize() and lose all its previous state.  This follows in part
 because the API might internally communicate with a device (e.g., a TPM,
 smartcard, other token) via a file descriptor, and it would be difficult
 to have two processes communicate with said device over the same file
 descriptor in non-atomic ways (the fd not being anything like a
 SOCK_DGRAM fd).
 
 Even if you arrange to re-open the device on the child side, your open
 sessions will need to be re-logged-in!
 
 Even if you arrange to establish new sessions by reference to old
 sessions, some cryptographic primitives fail catastrophically when
 reused incorrectly...
 
 So one can use pthread_atfork() (e.g., libpkcs11 in Illumos uses it to
 automatically re-initialize on the child-side of fork()) to avoid a lot
 of these issues, but again, suppose you want to do
 
     if (fork() > 0)
         _exit(0);
 
 But how do you indicate intent to continue with pre-fork() state in the
 child and not the parent?
 
 If the PKCS#11 / whatever state is buried N>2 layers deep then
 indicating intent to exit the parent can be impossible to do.
 
 Now, PKCS#11-using libraries could be made to use pthread_atfork() to
 reestablish state on the child side of fork(), but again, some things
 can't safely be reused, so intent to exit one or the other side of
 fork() is critical.
 
 We could have a variant of fork() that runs the pthread_atfork() child
 handlers in the parent and the parent handlers in the child... but that
 would have other weirdness.
 
 So if you want to exit the parent, then the only thing that actually
 works is this: fork() early, before complex state is setup.
 
 This brings me to a related issue: daemon() is bad.  It's bad because
 either the parent exits before the child is ready or complex state must
 be setup before daemon() that might not survive the fork().  Oops.  A
 decade ago in Solaris/Illumos we adopted an alternative design (which I
 use in Heimdal now) where two functions are used: one that fork()s and
 has the parent wait for the child to signal readiness, and the other
 (executed in the child) that signals readiness:
 
     daemon_prep(); /* Returns here in the child-side of fork(); waits in
                       read(2) on a pipe in the parent*/
     <setup code>
     /*
      * Tell the parent waiting inside daemon_prep() that the child is ready.
      *
      * The parent will exit.  If we exit, the parent will notice and exit with
      * an error.
      */
     daemon_ready();
 
 This has no fork-safety issues because all the code with fork-safety
 issues happens on the child-side of an early fork().
 
 This is extremely convenient:
 
 # kdc && echo ready
 ready
 # kinit -k && echo yes
 yes
 # 
 
 when you get the shell prompt back that means the service is either
 running or failed to start.  There is no way you can get the prompt back
 and the service subsequently fails to start.
 
 Whereas using daemon() this can happen:
 
 # kdc && echo ready
 ready
 # kinit || echo no
 no
 # 
 
 We adopted this approach in Solaris/Illumos because we replaced the SysV
 init and inetd system with a new one (SMF) that understands
 inter-service dependencies and does not want to start a service until
 its dependencies are running.  And that means needing to know precisely
 that a service has started, and the way we do that is by having the
 service's main program behave as described above.
 
 (SMF also has a process grouping mechanism for representing
 multi-process services.  This is used to, among other things, detect
 crashes of some such processes in order to restart the service.)
 
 One need not like/adopt SMF in order to appreciate/adopt the
 daemon_prep()/daemon_ready() approach.
 
 >   |  Now, vfork() is... clumsy because of the stack sharing silliness, but it
 >   |  predates threads, so its authors probably did not realize that taking a
 >   |  callback function and argument to run in a new stack would have been a
 >   |  superior design
 > 
 > When vfork() was designed, the total (guaranteed) address space was just
 > 64KB (text, data, stack, all combined).   Duplicating stacks (adding an
 > extra stack - and if you want to be able to return in the child, it actually
 > means copying the existing stack, while adjusting any self-referencing pointers
 > that occur there)  would have been laughed away as absurd.
 
 The extra stack can be tiny, since one expects the child to
 exec-or-exit..  But sure, I understand.  OTOH, copying as in fork()
 isn't exactly light on resource usage either!
 
 > In another message Nico.Williams%twosigma.com@localhost (kind of) quotes me:
 >    |  Robert Elz <kre%munnari.OZ.AU@localhost> wrote:
 >    |  > [ description of vfork_into_fork() elided ]
 > 
 > and then says...
 > 
 >    | That's a neat idea, but I don't think it's needed.  I can't think of why I
 >    | would ever need it or any time that I could have used it.
 > 
 > Maybe you never would have, but I know of one immediate use - that is /bin/sh
 > 
 > Our sh [...]
 
 Aha, thanks.  I get that a fork-me-after-all system call would simplify
 that shell.  That seems like a valid use case indeed (even if there are
 other ways to handle this).
 
 >    | What I really want is
 >    |        pid_t avfork(int (*)(void *), void *);
 >    | which is like vfork() but allocates a new stack, calls the given callback
 >    | in it just like pthread_create() would, and does not stop any threads in
 >    | the parent, not even the one that called it.
 > 
 > I have no objection to that, go ahead, write the code for it, and submit
 > it, it sounds useful enough to consider at least.
 > 
 > But...
 > 
 >   | Note that avfork() would have much the same constraints for the child as
 >   | vfork() does, except, naturally, that the avfork() child could return while
 >   | the vfork() child cannot.
 > 
 > Return to what?   You're having it execute a callback, are you saying that
 > that function can return?   Return to where exactly?   And what does that
 > mean?   What would be the difference between
 
 When main() returns, the program exits.
 
 When the callback function in pthread_create() returns, the thread
 exits.
 
 Ditto with avfork(): when the callback returns, the child process exits.
 
 >    child = avfork(func, &sp);
 > and
 >    if ((child = avfork(&sp)) == 0) func();
 > ??
 
 func() has to run in a separate stack in order to avoid having to stop the
 parent thread that called avfork().  Sharing a stack is the reason that
 the vfork() parent must stop while the child goes on.
 
 avfork() looks almost exactly like pthread_create() (minus pthread_attr_t).
 
 > If there's none, why the need for the callback?  If avfork() cannot
 
 The callback is the function to call on a new stack in the child.  Samd as with
 pthread_create(), only creating a child process that shares the parent's
 address space just like vfork().
 
 avfork() is like a combination of pthread_create() and vfork().
 
 > actually return in the child, so the second is not possible, then neither
 > can func() right?
 
 The func() is expected to execve() or _exit(), just like vfork()
 children.  But it can also return since it is a C function!  And just
 like main(), if it returns, the process (the child in this case) exits.
 
 Nico
 -- 
 



Home | Main Index | Thread Index | Old Index