Subject: Re: Issue with large memory systems, and PPC overhead
To: John Klos <john@sixgirls.org>
From: Chuck Silvers <chuq@chuq.com>
List: port-macppc
Date: 11/08/2002 00:54:20
hi,
the fork operation on powerpc isn't so slow.
here's a summary of what I tried:
ops/sec
fork+exit fork+exec+exit
180MHz 604e 1510 490
400MHz G4 8130 1500
300MHz P2 3175 1333
1GHz P3 10200 3745
"fork+exit" just forks and exits.
"fork+exec+exit" fork and execs another program which immediately exits.
all test programs statically linked.
so it's really exec that's relatively slow on ppc.
on the G4 that breaks down to:
% cumulative self self total
time seconds seconds calls ns/call ns/call name
12.04 4.20 4.20 190077 22096.31 31792.03 pmap_remove
6.48 6.46 2.26 120044 18826.43 18826.43 vcopypage
6.42 8.70 2.24 160293 13974.41 13974.41 __syncicache
5.48 10.61 1.91 2011224 949.67 1088.58 pmap_pvo_enter
5.19 12.42 1.81 11607659 155.93 155.93 splx
5.16 14.22 1.80 11799073 152.55 152.55 splraise
3.67 15.50 1.28 2818031 454.22 817.06 pmap_extract
3.67 16.78 1.28 20402 62738.95 62738.95 Idle
3.12 17.87 1.09 __mcount
2.67 18.80 0.93 40016 23240.70 23603.54 pmap_protect
2.38 19.63 0.83 600643 1381.85 22553.45 uvm_fault
1.81 20.26 0.63 trapagain
1.78 20.88 0.62 2011224 308.27 2808.60 pmap_enter
1.46 21.39 0.51 2044254 249.48 249.48 lockmgr
1.00 21.74 0.35 2112953 165.64 165.64 memcpy
and on the 604:
% cumulative self self total
time seconds seconds calls us/call us/call name
15.49 1.78 1.78 13044 136.46 136.46 pmap_copy_page
9.23 2.84 1.06 19081 55.55 63.34 pmap_remove
7.05 3.65 0.81 16708 48.48 48.48 __syncicache
4.70 4.19 0.54 200028 2.70 4.21 pmap_pvo_enter
3.57 4.60 0.41 618995 0.66 0.66 splx
3.57 5.01 0.41 58440 7.02 94.17 uvm_fault
2.44 5.29 0.28 240229 1.17 1.17 pool_get
2.44 5.57 0.28 __mcount
1.74 5.77 0.20 trapagain
1.48 5.94 0.17 186161 0.91 0.91 memcpy
1.39 6.10 0.16 635237 0.25 0.25 splraise
1.39 6.26 0.16 63546 2.52 83.62 trap
1.31 6.41 0.15 200026 0.75 9.71 pmap_enter
1.22 6.55 0.14 143246 0.98 1.11 uvn_findpage
1.04 6.67 0.12 278160 0.43 0.81 pmap_extract
1.04 6.79 0.12 16174 7.42 14.06 uvm_pagealloc_strat
we should be able to eliminate the __syncicache() calls (I've been looking
at this a bit already). I imagine there's some room for improvement
in pmap_remove(), though it will be tricky to figure that out,
since profiling can't see very well into the innards of that
because interrupts are blocked around the interesting part
of that function.
the idle time on the G4 is pretty odd, I don't know whether that's
real or an artifact of profiling.
one other interesting bit is the callee stats for pmap_remove():
G4:
-----------------------------------------------
4.20 1.84 190077/190077 uvm_unmap_remove [7]
[8] 22.8 4.20 1.84 190077 pmap_remove [8]
0.94 0.11 4981926/11341676 splx <cycle 1> [26]
0.76 0.00 4981926/11799073 splraise [27]
0.00 0.04 1800971/2011216 pmap_pvo_remove [128]
0.00 0.00 4981926/8990108 pmap_pvo_find_va [409]
0.00 0.00 4981926/12012737 pmap_interrupts_restore [408]
-----------------------------------------------
604:
-----------------------------------------------
0.00 0.00 4/19081 pmap_kremove [251]
1.06 0.15 19077/19081 uvm_unmap_remove [10]
[11] 13.3 1.06 0.15 19081 pmap_remove [11]
0.00 0.15 177853/200028 pmap_pvo_remove [43]
0.00 0.00 19081/717628 pmap_interrupts_restore [54]
0.00 0.00 442741/769101 pmap_pvo_find_va [404]
-----------------------------------------------
so on *average*, each call to pmap_remove() is for a range of 20-some pages,
of which 9 are mapped. considering that the text and data for the programs
involved here are 4 and 1 pages respectively, there's some other big mapping
going on that we might be able to shrink or eliminate.
here's the profile from a program that just execs itself a bunch of times
and then exits (on the 604):
% cumulative self self total
time seconds seconds calls ns/call ns/call name
8.31 0.54 0.54 11929 45267.83 45267.83 __syncicache
5.85 0.92 0.38 15059 25234.08 26889.16 pmap_remove
4.62 1.22 0.30 12075 24844.72 24844.72 pmap_zero_page
3.23 1.43 0.21 27243 7708.40 63311.81 uvm_fault
2.77 1.61 0.18 199642 901.61 901.61 lockmgr
2.62 1.78 0.17 84788 2005.00 2005.00 memcpy
2.00 1.91 0.13 __mcount
1.85 2.03 0.12 57641 2081.85 3312.25 pmap_pvo_enter
1.54 2.13 0.10 84785 1179.45 1179.45 pool_get
1.54 2.23 0.10 3003 33300.03 831291.57 sys_execve
1.38 2.32 0.09 234878 383.18 383.18 splx
1.38 2.41 0.09 trapagain
so it looks like getting rid of the need for all that icache syncing
will be a big improvement.
-Chuck
On Sat, Nov 02, 2002 at 07:14:56PM -0500, John Klos wrote:
> Another thing: fork overhead. I ran the following script on 1,260 numbers
> on several machines, and noticed that there is a large discrepency that is
> not explained by processor performance.
>
> #!/bin/sh
> total=0
> for number in `cat test`
> do
> total=`echo $total"+"$number | bc`
> echo $total
> done
>
> (No, I'd never use this in real life - awk is much quicker. Also note
> that the times below are user times, as these systems are heavily used)
>
> Why is it that it seems that forking on PowerPC has so much overhead?
> 68060, 66 MHz: 38.3 seconds, 33 per second, .5 per second per MHz
> 603e, 200 Mhz: 27.24 seconds, 46.25 per second, .23 per second per MHz
> 604ev, 350 MHz: 14.3 seconds, 88 per second, .25 per second per MHz
> G3, 300 MHz: 13.5 seconds, 93.3 per second, .31 per second per MHz
> P3, 550 MHz: 1.93 seconds, 652 per second, 1.19 per second per MHz
>
> Does anyone care to guess why this is? I suppose part of it could be
> explained by memory bus speed - the 68060 is on a 64 bit memory bus, and
> each PowerPC listed has progressively faster memory (and cache), but I
> don't thing that memory alone could explain a fourfold difference in
> efficiency between PowerPC and i386.
>
> Thanks,
> John Klos
> Sixgirls Computing Labs