Subject: Re: vfork vs. fork (was Re: popen reentrant (was Re: SA/pthread and
To: Matthias Buelow <mkb@mukappabeta.de>
From: Greywolf <greywolf@starwolf.com>
List: tech-kern
Date: 09/14/2003 14:57:04
Thus spake Matthias Buelow ("MB> ") sometime Today...

MB> Copy pages?  But only page tables have to be copied (something which is
MB> avoided with vfork() since there is only one address space for both
MB> processes) -- and shared pages are marked read-only so that
MB> copy-on-write kicks in.

Right, but even COW-forks (roast beef, anyone? :-) are more expen$ive
than vfork().

MB> That's not how it works, unless I misunderstand what you're meaning.

I'm probably wrong on some fronts, but, as I stated earlier, I'm always
learning something new.

MB> >Say you have a 400MB process.  You fork().  Congratulations.  You're now
MB> >up to 800MB (And Constantly Swapping, most likely).  You've just spent time
MB> >COPYING a 400MB footprint into memory.  Do you have that memory free?
MB>
MB> That's not what's happening with copy-on-write.  You won't run out of
MB> memory.  Even with a certain, often applied optimization, namely
MB> pre-copying a certain range of pages that will likely be accessed soon,
MB> to cut down the initial page fault spike, it won't be more than perhaps
MB> a couple dozen extra pages (I don't know if NetBSD implements this).

I think we do, actually, but with a proc size as large as that to which
I alluded, that's still pretty significant to even pretend to allocate.

MB> >vfork() has been a part of the system API since 3.0BSD.  If you can
MB>
MB> Not without interruption.  It has been "removed" in 4.4BSD (in that it
MB> uses ordinary fork and only suspends the parent).  Maybe they wanted to
MB> clean up stuff.  Particularly in System V derived systems (but not in
MB> all of them), vfork is also only emulated with fork, if at all
MB> present.  I don't remember that some of the systems which do this were
MB> extraordinarily slow because of that.

I didn't read that vfork() had been removed, but only that its semantics
had changed.

MB> >3.  ugly, how?  It creates a small space for a specific purpose.
MB> >4.  inelegant?  Hmmm...
MB>
MB> It "poisons" the API by adding a system call which doesn't offer any
MB> additional functionality, only a fast-path optimization for a very
MB> specific situation.

That "very specific situation" also happens to be a "very often encountered
situation".  Builds come to mind; I don't know about anyone else, but
there's a large margin of time spent for me in this "very specific situation".


MB>  And even one which someone already proficient with
MB> fork semantics has to go back to the manual and read up on it, for it
MB> introduces some traps which the naive programmer can easily fall into,
MB> so it also violates the principle of least astonishment.

Not if you understand what vfork() really does in that the child shares
the VM space of the parent, and "don't do that, then".

It's given that a vfork() will suspend operations in the parent, while
a fork() doesn't (unless the parent wait()s for it)...

MB> >5.  It's been part of the API since 3.0BSD.  That's well over 20 years!
MB>
MB> I found these few statements regarding the introduction of vfork
MB> on a plan9 mailing list archive:

"Oh, well, if you're gonna go talk to Plan 9 people, they'll all tell
you that UNIX sucks ass anyway." :-)

MB> The "Right Thing" to do is copy-on-write, which the ordinary fork
MB> does.  This is something that goes on behind the scene, without
MB> bothering the application programmer.

Well...sort of.  It seems to me that what we need is a RACOW (Realloc-
And Copy On Write), i.e. as long as we can share the space, do; if
we touch anything in either space, *then* we set up the copy of the
affected pages.  This would, of course, be more expen$ive than either
a vfork() or a standard COW fork().

MB> If it is deemed that fork() is not sufficiently flexible (regarding
MB> the configuration of the forked address space), why not go the whole
MB> way and replace it with something like Plan 9's rfork()?  It accepts
MB> an extra flags argument where you can specify a whole lot of things,
MB> like sharing data and bss segments, or file descriptors, automatically
MB> disassociating child from the parent (eliminates the double-fork
MB> technique used on Unix), or settings affecting other resources more
MB> special to plan9 (like sharing name space).
MB> You could then implement fork(), vfork() as library functions on
MB> top of this.

You know, upon reading this, at first glance (without exploring it
further) it looks like a good idea.  I'm sure I'm going to be told
that I am a gullible idiot in a short while.

MB> The original Unix API is very much as something designed on a
MB> blackboard, with somewhat simplified assumptions.  The fact that it has
MB> worked so very well and that it's still in use today pays tribute to
MB> that simple and general design.  That it doesn't fully work out in
MB> practice today means that the core API should be revised, not hacks and
MB> workarounds introduced beside the original API.  It's more important to
MB> keep the "spirit" of the Unix design, not the actual way of
MB> implementation, as if it were chiseled in stone.

Well, unfortunately, the API *is* rather set in stone.  It takes social
engineering by committee to make changes.  Revising the core API
would require a sandblaster at this point because EVERYONE depends
on the current core API as not changing.

MB> Just for the records, I just did some timings fork vs. vfork.  I
MB> [v]fork/execl'd /usr/bin/true (basically /bin/sh with a script) for a
MB> certain number of iterations and found out that vfork() does indeed
MB> have a certain effect (I never expected it NOT to be faster, mind
MB> you):  the tradeoff was about 50% on NetBSD 1.6.1/i386.  On
MB> Solaris/sparc (which also seems to implement a faster vfork), it was
MB> less pronounced -- only about 20%.  What do these results show?  The
MB> speedup isn't crucial for most applications (which do not have an
MB> extremely high fork/exec rate); but in certain situations it might be
MB> beneficial.  Although if you have to wait over an hour for a typical
MB> job to complete, shaving off a minute or two probably isn't that
MB> spectacular.

50 percent?  That's half -- a big win in anyone's book.  That means that
my build of the toolchain is only taking 13 minutes instead of 26.

Even if it drops to 20% in the worst case, that's still significant.
That means that my build of the toolchain is taking 13 minutes instead
of 16:15.

...and that means that my world build is taking an hour instead of 1:15:00.
I can make use of that 15 minutes in much better ways than waiting for a
build to complete, such as installing and reconfiguring the new system.

MB>  The best idea would imho be to implement something like
MB> plan9's rfork as system call; and use fork/vfork etc. as library
MB> functions.  Then the kernel API is clean again -- there's only one
MB> function to do it, and no additional crufty hack without added
MB> functionality is required for some special optimizations.
MB>
MB> BTW., while perusing the manual pages of various OSes (where I looked
MB> up the Plan9 manpage), I found out <<warning: I understand that this
MB> might be a hot iron for some readers>> that Op*nBSD also implements an
MB> rfork(2) call that looks very much like the Plan9 one.  So it isn't
MB> without precedent, at least.

I have no comment appropriate for this discussion regarding any sort
of bias on other operating systems.  They're out there, they're not going
away any time soon, and some of them appear to have some good ideas.

				--*greywolf;
--
NetBSD:  All your platform are belong to us.