Subject: Re: More info on shared library performance problem
To: Vareck Bostrom <bostrov@lava.cs.orst.edu>
From: Gordon Irlam <gordoni@base.com>
List: port-sparc
Date: 11/21/1994 22:45:52
I saw something similar with SunOS once.  It almost drove me mad.
This sounds like it might be the same problem.

If you compiled and ran dhrystone on a single machine it would run
really slowly.  But if you compiled dhrystone on one machine and ran it
on a different machine it would run really fast.  But not always.  Only
sometimes.  Sometimes it always ran slow.  Sometimes it always ran fast.
But where you compiled the executable seemed to be having some sort of
effect.

The explanation was the linker would read libc into memory at one
virtual address to perform the dynamic linking, and latter when the
program was executing libc would get mmaped in at a different address.

On SPARC machines with a virtually addressed cache for a physical
page mapped to more than one virtual address to be cacheable it is
necessary for the page to follow the modulo alignment rule.  This
specifies that the page can only be mapped to addresses that are
multiples of 256k -- I think.  (A smaller value will work for most
machine, but is not specified as part of the architecture).  This
ensures that the same physical line will not be present at more than
one location in the cache.

When you do an mmap if possible the OS will try and preserve the
modulo alignement rule, but if it is not possible (for instance
if the address to map at is explicitly specifed), then the OS has
no alternative but to mark the pages non-cacheable.

If all of libc was not already in memory when the linker read libc it
would get loaded into the OS's "buffer cache" at one set of virtual
addresses, and latter when libc was mmap'ed in it would get loaded
at another set of virtual addresses that violated the modulo
alignment rule.  This was what sometime happened if you compiled
and linked dhrystone on the same machine, but not if you used separate
machines.

dhrystone spends a lot of time executing .mul within libc.  If suddenly
all the pages of libc are non-cacheable, instead of being able to
fetch one instruction per cycle it now takes 20-30 cycles per instruction,
and dhrystones runs really slowly.

SunOS had a really good vm system, but not good enough.  In this case
the problem could have been fixed by violating the modulo alignment
rule for pages that are read only, or making non-cacheable pages
cacheable again as soon as the modulo alignment rule is restored, but
it was probably not worth the effort.

Hope this information might be helpful in solving your problem.
It sounds like it might be closely related.

                                                  Gordon.

[PS: The earlier problems reported with this benchmark sounded like
simple cache conflicts.  These are something dhrystone is prone to cause.
It is especially likely on machines like the SS5 with a small on-chip
cache.  I once got to use a debugger to step through the inner loop of
dhrystone listing the cache lines being accessed and was able to verify
this is sometimes that sometimes happens.]

> We noticed while expirmenting with different compiler flags that if one 
> changes the inode of /usr/lib/libc.so.12.0 (by copying it over 
> itself, for example) that the performance is drastically improved. 
> We recorded dhrystone results of 24,000+ with dynamically linked dhrystones,
> however, the performance problem creeps back upon reboot. Also, if one
> runs a large program like Xsun or emacs the problem reappears. This can
> once again be fixed by copying over /usr/lib/libc.so.12.0 with 
> itself. 
> 
> Maybe someone could provide some insight as to what the dynamic 
> linker does when a library changes as noted above, and also what happens
> when a large program is loaded. 
> 
> - bostrov@lava.cs.orst.edu
> - thorpej@cs.orst.edu