port-macppc: ppc benchmarks, quick and dirty. 604ev, g3, mips

Subject: ppc benchmarks, quick and dirty. 604ev, g3, mips
To: netbsd-macppc <port-macppc@NetBSD.ORG>
From: Riccardo Mottola <rollei@tiscalinet.it>
List: port-macppc
Date: 03/12/2005 15:43:10
Hey all..

there has been some talk here lately over 604ev and its "might power" in
FPU, memory badnwith, G3 being jsut a pepped up 603, etc etc.


I run some quick benchs with some FFT code I have. I just did one or two
runs for eanch executable and there are compiler differences, so I
report all results. Still the g3 and 604ev benchmarks should be
reasonable if the executables created both on gcc3 are taken (and notice
how the times differ..)

summing up:
fastest G3 running time: 36.5 seconds about. no noticeable difference
with extra-g3 options
fastest 604ev running time: 81.3 seconds. Done with ggc3 binary and no
noticeable difference among specific 604 options!
fastest 601 running time : 234 seconds. This with an older compiler,
120Mhz, smaller cache on the mainboard. But quick scale up to 350Mhz (a
semi-reasonable assumption for this kind of tests) 80.2 seconds.
the original rs/6000.. with mighty 20Mhz still pounds a respectable
500seconds time! I had no time to run the other options on it though.
Could the OS play such an important role? I doubt, but it could explain
some of the strangenesses.

This confirms my idea that the 601 processor is in fact very good. Many
reported it as "slow" but in fact is ok! When I had a 604@120 I remember
really close times with the same compiler/cpu when doing many tasks.

but you can see that teh g3 appears much faster even in this almost pure
FPU bound processing! I htink the assumptions that the G3 has bad FPU
don't generate from hard data.

I have about the same results when running "setiathome" and checking
average running times, but here I have controls over the compiler
options and the dataset is probably much smaller here, so more a kind of
"microbench"


--------------------------------
FFT running times:

4096 samples, the data set is
2+4 arrays of doubles of 4096 samples
that is 6 * sizeof(double) * 4096 = 6 * 6 * 4096 = 196608KBytes assuming
8 bytes doubles.

Comments,

tests are sub-optimal: (I run them only once, or just twice if some
value seemed too unreasonable). gcc 2.95 optimizaionts for ppc are
dubious and not existent for r5000

I don't understand why the ppc604ev seems so slow. Everybody thought it
would be faster than the g3! and the fft calculus should be more cpu
bound than memory bound and the dataset should fit in the cache. Could
the os do something wrong (the 604 is not certified for 10.15 macos)
like not enabling or enabling "badly" the cpu L2 cache?

Why does mipspro code slow down so much when -IPA is used? and also -rX
dlows it down in respect to only -mipsN!

1 - cc no options
2 - -O1
3 - -O2
4 - -O3
5 - -O1 -mcpu + mtune for the specific cpu (or equivalent most specific
optim.)
6 - -O2 cpuspecific
7 - -O3 cpuspecific

B/W G3: G3, 350Mhz, 1MB cache, 100Mhz bus speed. Running 10.2.8
cc based on gcc 3.1
assumed "750"
(1) 69.84
(2) 41.69
(3) 38.94
(4) 36.58
(5) 38.61
(6) 39.49
(7) 36.61


9600: 604ev, 350Mhz, 1MB cache, Running 10.1.5
A times are executables from the gcc 3.1 of the g3, b times are the gcc
2.95 based compiler
assumed "604"
for gcc 2.95 I found no ppc specific optim doc, so I used mtune and
mcpu, but it is possibly wrong.
(1-a) 166
(1-b) 114.85
(2-a) 83.72
(2-b) 82.03
(3-a) 73.46
(3-b) 105
(4-a) 81.30
(4-b) 105
(5-a) 81.44
(5-b) 109
(6-a) 95.5
(6-b) 88.5
(7-a) 81.39
(7-b) 110

8200: 601, 120Mhz, 256K cache, MkLinux R2RC2
compiler is gcc 2.96 (se cpu optim. note for the 9600, I assumed "601" n
case)
(1) 332
(2) 252.9
(3) 256
(4) 266
(5) 269
(6) 267
(7) 234

rs6000 320: power processor, 20Mhz, AIX 4.2
compiler is gcc 2.95 -mcpu=rs6000 -mtune=rios1
(1) 1571
(2)
(3)
(4)
(5)
(6)
(7) 499

Indigo2: MIPS R4000, 100Mhz, 1Mb cache, Irix 6.5
A times are gcc 2.95 times (-mcpu=r4000 -mips3 for gcc optim)
B times refer to mipspro (-mips3 was always specified)
specific optim were done with -IPA and -r4000 for 5 to 7 tests
(1-a) 665
(2-a) 210.56
(3-a) 215
(4-a) 212
(5-a) 264
(6-a) 253
(7-a) 250
(1-b) 530
(2-b) 437
(3-b) 193
(4-b) 155
(7-b) 155
(7-b) without -IPA 159


Indy: R5000, 180Mhz, 512K cache, Irix 6.5
A times are gcc 2.95 times (the code was 0ptimized for r4000 when
compiled with gcc 2.96, since no mips4 was present)
B times refer to mispro, check Indigo2 for explanations.
(-mips4 was always specified) I then added -IPA and -r5000 for 5 to 7
and tried removing -IPA again in a second run.
(1-a) 259.59
(2-a) 155.36
(3-a) 183.9
(4-a) 155.24
(5-a) 161.5
(6-a) 154.21
(7-a) 154.11
(1-b) 228
(2-b) 210.81
(3-b) 166.81
(4-b) 108.8
(6-b) without -IPA 233
(7-b) 141.75
(7-b) but without -IPA 118

Indigo2: MIPS R10000, 195Mhz, 1MB cache, irix 6.5
A times, gcc 3.4.0, cpu specific options are -mips4 only
B times refer to mipspro, as above for 5 to 7, r1000 was specifiec, for
all -mips4
(1-b) 117.7
(2-b) 119
(3-b) 49
(4-b) 34
(5-a) 54
(5-b) 109
(6-a) 49
(6-b) without -IPA 47
(7-a)  47
(7-b) 35
(7-b) without -IPA 32