Subject: Re: Issue with large memory systems, and PPC overhead
To: John Klos <john@sixgirls.org>
From: Chuck Silvers <chuq@chuq.com>
List: port-macppc
Date: 11/08/2002 00:54:20
hi,

the fork operation on powerpc isn't so slow.
here's a summary of what I tried:

			ops/sec
		fork+exit	fork+exec+exit
180MHz 604e	1510		490
400MHz G4	8130		1500

300MHz P2	3175		1333
1GHz P3		10200		3745


"fork+exit" just forks and exits.
"fork+exec+exit" fork and execs another program which immediately exits.
all test programs statically linked.


so it's really exec that's relatively slow on ppc.

on the G4 that breaks down to:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 12.04      4.20     4.20   190077 22096.31 31792.03  pmap_remove
  6.48      6.46     2.26   120044 18826.43 18826.43  vcopypage
  6.42      8.70     2.24   160293 13974.41 13974.41  __syncicache
  5.48     10.61     1.91  2011224   949.67  1088.58  pmap_pvo_enter
  5.19     12.42     1.81 11607659   155.93   155.93  splx
  5.16     14.22     1.80 11799073   152.55   152.55  splraise
  3.67     15.50     1.28  2818031   454.22   817.06  pmap_extract
  3.67     16.78     1.28    20402 62738.95 62738.95  Idle
  3.12     17.87     1.09                             __mcount
  2.67     18.80     0.93    40016 23240.70 23603.54  pmap_protect
  2.38     19.63     0.83   600643  1381.85 22553.45  uvm_fault
  1.81     20.26     0.63                             trapagain
  1.78     20.88     0.62  2011224   308.27  2808.60  pmap_enter
  1.46     21.39     0.51  2044254   249.48   249.48  lockmgr
  1.00     21.74     0.35  2112953   165.64   165.64  memcpy

and on the 604:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 15.49      1.78     1.78    13044   136.46   136.46  pmap_copy_page
  9.23      2.84     1.06    19081    55.55    63.34  pmap_remove
  7.05      3.65     0.81    16708    48.48    48.48  __syncicache
  4.70      4.19     0.54   200028     2.70     4.21  pmap_pvo_enter
  3.57      4.60     0.41   618995     0.66     0.66  splx
  3.57      5.01     0.41    58440     7.02    94.17  uvm_fault
  2.44      5.29     0.28   240229     1.17     1.17  pool_get
  2.44      5.57     0.28                             __mcount
  1.74      5.77     0.20                             trapagain
  1.48      5.94     0.17   186161     0.91     0.91  memcpy
  1.39      6.10     0.16   635237     0.25     0.25  splraise
  1.39      6.26     0.16    63546     2.52    83.62  trap
  1.31      6.41     0.15   200026     0.75     9.71  pmap_enter
  1.22      6.55     0.14   143246     0.98     1.11  uvn_findpage
  1.04      6.67     0.12   278160     0.43     0.81  pmap_extract
  1.04      6.79     0.12    16174     7.42    14.06  uvm_pagealloc_strat


we should be able to eliminate the __syncicache() calls (I've been looking
at this a bit already).  I imagine there's some room for improvement
in pmap_remove(), though it will be tricky to figure that out,
since profiling can't see very well into the innards of that
because interrupts are blocked around the interesting part
of that function.

the idle time on the G4 is pretty odd, I don't know whether that's
real or an artifact of profiling.

one other interesting bit is the callee stats for pmap_remove():

G4:
-----------------------------------------------
                4.20    1.84  190077/190077      uvm_unmap_remove [7]
[8]     22.8    4.20    1.84  190077         pmap_remove [8]
                0.94    0.11 4981926/11341676     splx <cycle 1> [26]
                0.76    0.00 4981926/11799073     splraise [27]
                0.00    0.04 1800971/2011216     pmap_pvo_remove [128]
                0.00    0.00 4981926/8990108     pmap_pvo_find_va [409]
                0.00    0.00 4981926/12012737     pmap_interrupts_restore [408]
-----------------------------------------------

604:
-----------------------------------------------
                0.00    0.00       4/19081       pmap_kremove [251]
                1.06    0.15   19077/19081       uvm_unmap_remove [10]
[11]    13.3    1.06    0.15   19081         pmap_remove [11]
                0.00    0.15  177853/200028      pmap_pvo_remove [43]
                0.00    0.00   19081/717628      pmap_interrupts_restore [54]
                0.00    0.00  442741/769101      pmap_pvo_find_va [404]
-----------------------------------------------



so on *average*, each call to pmap_remove() is for a range of 20-some pages,
of which 9 are mapped.  considering that the text and data for the programs
involved here are 4 and 1 pages respectively, there's some other big mapping
going on that we might be able to shrink or eliminate.


here's the profile from a program that just execs itself a bunch of times
and then exits (on the 604):

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
  8.31      0.54     0.54    11929 45267.83 45267.83  __syncicache
  5.85      0.92     0.38    15059 25234.08 26889.16  pmap_remove
  4.62      1.22     0.30    12075 24844.72 24844.72  pmap_zero_page
  3.23      1.43     0.21    27243  7708.40 63311.81  uvm_fault
  2.77      1.61     0.18   199642   901.61   901.61  lockmgr
  2.62      1.78     0.17    84788  2005.00  2005.00  memcpy
  2.00      1.91     0.13                             __mcount
  1.85      2.03     0.12    57641  2081.85  3312.25  pmap_pvo_enter
  1.54      2.13     0.10    84785  1179.45  1179.45  pool_get
  1.54      2.23     0.10     3003 33300.03 831291.57  sys_execve
  1.38      2.32     0.09   234878   383.18   383.18  splx
  1.38      2.41     0.09                             trapagain



so it looks like getting rid of the need for all that icache syncing
will be a big improvement.

-Chuck


On Sat, Nov 02, 2002 at 07:14:56PM -0500, John Klos wrote:
> Another thing: fork overhead. I ran the following script on 1,260 numbers
> on several machines, and noticed that there is a large discrepency that is
> not explained by processor performance.
> 
> #!/bin/sh
> total=0
> for number in `cat test`
> do
>   total=`echo $total"+"$number | bc`
>   echo $total
> done
> 
> (No, I'd never use this in real life - awk is much quicker. Also note
> that the times below are user times, as these systems are heavily used)
> 
> Why is it that it seems that forking on PowerPC has so much overhead?
> 68060, 66 MHz:  38.3 seconds,  33 per second,    .5 per second per MHz
> 603e, 200 Mhz:  27.24 seconds, 46.25 per second, .23 per second per MHz
> 604ev, 350 MHz: 14.3 seconds,  88 per second,    .25 per second per MHz
> G3, 300 MHz:    13.5 seconds,  93.3 per second,  .31 per second per MHz
> P3, 550 MHz:    1.93 seconds,  652 per second,   1.19 per second per MHz
> 
> Does anyone care to guess why this is? I suppose part of it could be
> explained by memory bus speed - the 68060 is on a 64 bit memory bus, and
> each PowerPC listed has progressively faster memory (and cache), but I
> don't thing that memory alone could explain a fourfold difference in
> efficiency between PowerPC and i386.
> 
> Thanks,
> John Klos
> Sixgirls Computing Labs