netbsd-users: Re: xorg and shared libraries

Subject: Re: xorg and shared libraries
To: Matthias Scheler <tron@zhadum.de>
From: Greg A. Woods <woods@weird.com>
List: netbsd-users
Date: 01/13/2005 16:34:19
[ On Thursday, January 13, 2005 at 18:32:56 (+0000), Matthias Scheler wrote: ]
> Subject: Re: xorg and shared libraries
>
> While "static" binaries are faster on operations as exec() and fork()
> they are bad for overall system performance simply because they
> require much memory.

That's simply not true at all -- or at least it's a very gross and
extremely misleading slant on what actually happens on most systems, and
what would happen were they to have the majority of their user-land
static-linked.

And fork() is never a concern -- copy-on-write solves that problem.  The
real problem is only with having to also exec ld.so _every_ time for
_every_ process, and then of course it's runtime is never very quick
either.

As you know I have a great deal of current experience with NetBSD on
this very topic.


> If three processes run different applications use
> the same shared library it will be in system memory once (except for
> pages with library local data which has been modified).

Since on most unix-ish systems the same commands are run quite
frequently (and on some systems almost always), the pages for each
executable appear only once in system memory (shared text), with the
buffer cache keeping them in memory even when all the process instances
for a give command exit so that upon startup they more or less simply
begin to execute with no delay whatsoever.

The shared-text effect is of course present for both dynamic-loaded and
static binaries, but the effect is much better with static binaries.

Note that with less general use, e.g. on machines that primarily run
compiles, etc., the overhead wasted on ld.so is _enormous_; while the
"waste" due to duplication of library object code between the few
different programs that are run repeatedly is very tiny by comparison.


> If these
> process would instead run three binaries which were statically
> linked with the same library the library would be in memory three
> times.

While this is true the effects are much much much smaller than you
imply.

Since static binaries contain _only_ the library object code that they
explicitly require to run, they do not duplicate whole libraries but
usually only tiny fractions of most libraries (esp. things like libc).

Also the more shared libraries a program needs, the worse its startup
time becomes, while if it were static-linked and assuming it only uses a
portion of the object code from each library, its startup time is almost
instantaneous.


> These effects can of course not be measured that easily.

Well, that depends on your perspective.

I've been exclusively using modern static-linked NetBSD systems for over
a year now, and the benefit to the user's experience I see is quite
apparent, never mind what the measurements show.  :-)

However you're right on spot with that in one way.  As with all
benchmarks, measurements to show the difference between static and
dynamic linking depend entirely on the real-world workload and so if the
benchmark doesn't match the workload then it can only mislead.

This effect is exacerbated in this particular issue due to the effects
of such other system features as demand-paging of all exec pages.  For
example a static-linked program which only makes sparse or one-time use
of most of the object code it loads from any given library may end up
with an RSS that's nearly as small as it's dynamic-linked twin.
Meanwhile the overhead of loading those one-time-used pages and then
reclaiming them is lower overall than that of exec()ing ld.so and
runtime linking of the shared library, even if ld.so's text pages, and
all the pages of the library, are already in RAM.

In the end though it's not that hard to measure the overall time to run
something that runs lots of processes, such as "build.sh", on identical
hardware, but once with a dynamic-linked toolchain and once with a
static-linked toolchain.  Indeed someone did this on a m68k system some
time ago and reported their results on one of the lists.  I can't find a
link to the posting, but IIRC the savings of static-linking was on the
order of _HOURS_ of runtime.

I've been happy enough with static-linking that I'm loath to even try
dynamic-linking again, even just to do a comparison test.  I'm guessing
though that the 6-1/2 hour build from scratch to ISO I did on a P-III
700MHz machine last night would have taken at least 7-8 hours if the
host OS and toolchain et al had been dynamic-linked.

The savings of static-linking are very real and very noticable even
without careful scientific measurement.

(of course that machine has 1GB of RAM, but RAM really is CHEAP!!!)


> But I guess running a dynamically and statically linked KDE 3
> on a machine with only 128MB memory will demonstrate the effect.

That would depend entirely on what KDE components were run by the user.

(Can you even run KDE-3 usefully on a 128MB machine at all?  :-)

It's also an rather bad example since there are great gobs of bloated
object code in any massive application framework like KDE that are
shared amongst different programs and since many of those programs are
often run simultaneously, and since even the tiniest program requires
almost all of the framework code (e.g. "hello world" equivalent static
linked would be almost just as big as full text editor, since the ratios
of unique object code per program are very tiny compared to the common
framework code that all programs require).

I have found through quite extensive real-world experience now that even
plain old basic X11, once thought to be a huge bloated application
framework all by itself, can be static-linked without major overhead, at
least given my daily usage patterns, and the result is a quite pleasing
improvement in response time for all programs on startup.

For example the desktop I sit behind 90% of the time is a diskless 128MB
SS20.  I run a fully static-linked system on it will 100% static-linked
applications, including all of /usr/X11R6.  It rarely, if ever, pages
to/from swap during normal use.  Now of course I don't run KDE on it,
nor even any other monsters like Mozilla et al. (which would take up
over half the RAM with one single process even though it is dynamic
linked), but I do run my window manager (ctwm) there, and lots of
xterms, and lots of little widgets like xconsole, xload's, xclock's, the
occasional emacs, etc.  I can show you the 'vmstat -s' output after 313
days of uptime if you'd like, but I think these few lines tell the tale:

139861700 total faults taken
 ...
     2742 swap pages in use
     2508 swap allocations
 ... 
     5892 swap ins
     6026 swap outs
        0 pages swapped in
     6861 pages swapped out
  1026242 forks total
 
Remember, that's after nearly a _year_ of uptime and daily operation.
(I've only restarted by X session, i.e. logged out and logged in again,
ten times during that period though, so the total number of forks isn't
as huge as it would have been had I done so every day, but it does run
cron and /etc/daily et al  :-)

Some of the overhead of using dynamic libraries could be eliminated with
a pre-binding cache, such as is done on Darwin and thus Mac OS-X (and
perhaps some other modern systems, especially any other Mach-based ones).

However the only real solution to the unnecessary overhead of having to
exec() two programs for every one process is to implement kernel-based
shared libraries (e.g. as was done oh-so-long-ago on UNIX System V
Release 3).

-- 
						Greg A. Woods

H:+1 416 218-0098  W:+1 416 489-5852 x122  VE3TCP  RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>