Subject: vfork vs. fork (was Re: popen reentrant (was Re: SA/pthread and vfork))
To: Greywolf <greywolf@starwolf.com>
From: Matthias Buelow <mkb@mukappabeta.de>
List: tech-kern
Date: 09/14/2003 23:20:44
Greywolf writes:

>[parroting what I've just learned about vfork():]
>If you fork(), even if you don't touch any of the pages before calling
>exec(), you still have to copy all the parent's pages into the child.
>This could potentially run you out of (virtual and physical) memory.

Copy pages?  But only page tables have to be copied (something which is
avoided with vfork() since there is only one address space for both
processes) -- and shared pages are marked read-only so that
copy-on-write kicks in.

>If you vfork() (if I am understanding correctly), you are basically
>allocating pointers to the parent's pages, with very few localisms to the
>child of the vfork() -- enough to close, open and ioctl some descriptors
>before continuing (not all of that is desirable, necessarily...).

That's not how it works, unless I misunderstand what you're meaning.

>Say you have a 400MB process.  You fork().  Congratulations.  You're now
>up to 800MB (And Constantly Swapping, most likely).  You've just spent time
>COPYING a 400MB footprint into memory.  Do you have that memory free?

That's not what's happening with copy-on-write.  You won't run out of
memory.  Even with a certain, often applied optimization, namely
pre-copying a certain range of pages that will likely be accessed soon,
to cut down the initial page fault spike, it won't be more than perhaps
a couple dozen extra pages (I don't know if NetBSD implements this).

>vfork() has been a part of the system API since 3.0BSD.  If you can

Not without interruption.  It has been "removed" in 4.4BSD (in that it
uses ordinary fork and only suspends the parent).  Maybe they wanted to
clean up stuff.  Particularly in System V derived systems (but not in
all of them), vfork is also only emulated with fork, if at all
present.  I don't remember that some of the systems which do this were
extraordinarily slow because of that.

>3.  ugly, how?  It creates a small space for a specific purpose.
>4.  inelegant?  Hmmm...

It "poisons" the API by adding a system call which doesn't offer any
additional functionality, only a fast-path optimization for a very
specific situation.  And even one which someone already proficient with
fork semantics has to go back to the manual and read up on it, for it
introduces some traps which the naive programmer can easily fall into,
so it also violates the principle of least astonishment.

>5.  It's been part of the API since 3.0BSD.  That's well over 20 years!

I found these few statements regarding the introduction of vfork
on a plan9 mailing list archive:

(from http://bio.cse.psu.edu/~schwartz/9fans/2001-October.txt)

"According to some of the Berkeleyites, there was a flaw in
 the VAX-11/750 memory management unit (microcode?) such
 that they were unable to use copy-on-write.  When I
 mentioned this to the AT&T UNIX System V developers, they
 said it seemed to work fine for them.."

"yeah, i heard that story about the comet.  we had 780's
 and it just worked fine and was faster and cleaner than
 vfork.  i got kre to help me run some benchmarks.  vfork
 seemed to go to a _lot_ of trouble to do very little."

"the bits required aren't there in the hardware page tables, but you
 can either emulate them or do copy on reference.  if you use a paging
 algorithm with local policy instead of the global one used by
 berkeley, you can also prevent some pathetic effects under load, and
 remove complex data structures and code (or not require them in the
 first place).  this was well studied during the late 1960s and early
 1970s.  the BSD approach is poor.  unfortunately, people mimic it.
 i can only assume that one or more textbooks wrote it up."

"Boyd's right on this one: vfork was done for the 780.
 The 750, being a different implementation, had different
 MMU peculiarities, and there were some difficulties getting
 COW to work on it.  I remember John Reiser struggling with
 it for 32V.  I believe he eventually conquered it, but it wasn't
 easy."

>vfork() problems only appear to have resurfaced wrt threading of late.
>It doesn't add any functionality, per se, but it does attempt to optimise
>the use of resources (such as avoiding a pointless copy on something
>that isn't going to be used).

The "Right Thing" to do is copy-on-write, which the ordinary fork
does.  This is something that goes on behind the scene, without
bothering the application programmer.

>In fact, I'm wondering if there could be some way at some point to
>pass an advisory flag to fork() [I think there should have been in

If it is deemed that fork() is not sufficiently flexible (regarding
the configuration of the forked address space), why not go the whole
way and replace it with something like Plan 9's rfork()?  It accepts
an extra flags argument where you can specify a whole lot of things,
like sharing data and bss segments, or file descriptors, automatically
disassociating child from the parent (eliminates the double-fork
technique used on Unix), or settings affecting other resources more
special to plan9 (like sharing name space).
You could then implement fork(), vfork() as library functions on
top of this.

>the first place, but I can see that the fork() semantics were needed
>fairly immediately, so there was really no room for embellishment on
>the basic functionality at the time, and now we're stuck with that].

The original Unix API is very much as something designed on a
blackboard, with somewhat simplified assumptions.  The fact that it has
worked so very well and that it's still in use today pays tribute to
that simple and general design.  That it doesn't fully work out in
practice today means that the core API should be revised, not hacks and
workarounds introduced beside the original API.  It's more important to
keep the "spirit" of the Unix design, not the actual way of
implementation, as if it were chiseled in stone.


Just for the records, I just did some timings fork vs. vfork.  I
[v]fork/execl'd /usr/bin/true (basically /bin/sh with a script) for a
certain number of iterations and found out that vfork() does indeed
have a certain effect (I never expected it NOT to be faster, mind
you):  the tradeoff was about 50% on NetBSD 1.6.1/i386.  On
Solaris/sparc (which also seems to implement a faster vfork), it was
less pronounced -- only about 20%.  What do these results show?  The
speedup isn't crucial for most applications (which do not have an
extremely high fork/exec rate); but in certain situations it might be
beneficial.  Although if you have to wait over an hour for a typical
job to complete, shaving off a minute or two probably isn't that
spectacular.  The best idea would imho be to implement something like
plan9's rfork as system call; and use fork/vfork etc. as library
functions.  Then the kernel API is clean again -- there's only one
function to do it, and no additional crufty hack without added
functionality is required for some special optimizations.

BTW., while perusing the manual pages of various OSes (where I looked
up the Plan9 manpage), I found out <<warning: I understand that this
might be a hot iron for some readers>> that Op*nBSD also implements an
rfork(2) call that looks very much like the Plan9 one.  So it isn't
without precedent, at least.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}