Subject: Re: improving ssh performance on sun4m systems
To: None <port-sparc@netbsd.org, uwe@ptc.spbu.ru>
From: None <eeh@netbsd.org>
List: port-sparc
Date: 03/15/2002 23:38:01
| That's the point you miss.  Functions that you write yourself that are
| not in the library anywhere (and those that are too) *do call* ".umul"
| &co from libc.so for multiplication &co.
|
|     volatile int a = 4, b = 3;	/* force gcc to perform the multiplication */
|     main() { return (a*b); }	/* not in the library */
|
|
| compiles to:
|
| main:
|         save %sp,-104,%sp
|         sethi %hi(a),%o1
|         ld [%o1+%lo(a)],%o0
|         sethi %hi(b),%o2
|         ld [%o2+%lo(b)],%o1
|         call .umul,0		! <-- .umul comes from libc.so
|          nop
|         ret
|          restore %g0,%o0,%o0

Ewww.  That's icky code.  You have load/use penalties between each
sethi and load, which means they cannot be issued in parallel on a
superscalar processor.  And the delay slot has not been filled.
Here's the same code generated by "cc -mv8 -O3"

main:
        sethi   %hi(a), %g2
        ld      [%g2+%lo(a)], %o1
        sethi   %hi(b), %g3
        ld      [%g3+%lo(b)], %o0
        retl
        smul    %o1, %o0, %o0

Same load/use penalties, but this is a leaf function so there's no
save, and a multiply instruction is used instead of a function call.
Here's the results with "cc -mtune=supersparc -O3":

main:
        save    %sp, -104, %sp
        sethi   %hi(a), %o1
        sethi   %hi(b), %o2
        ld      [%o1+%lo(a)], %o0
        ld      [%o2+%lo(b)], %o1
        call    .umul, 0
         nop
        ret
        restore %g0, %o0, %o0

Here we have the sethi instructions together.  They can be issued
in parallel, as could the loads.  We're still calling the .umul
function, and the delay slot is not filled.  This is code that will
execute on v7 and older CPUs and also make use of multiple function
units on v8 and v9 cpus.  Here's the v9 code for comparison:

main:
        sethi   %hi(a), %g2
        sethi   %hi(b), %g3
        add     %g2, %g4, %g2
        add     %g3, %g4, %g3
        ld      [%g2+%lo(a)], %o0
        ld      [%g3+%lo(b)], %o1
        mulx    %o0, %o1, %o0
        retl
        sra     %o0, 0, %o0

(Ick.  We're still using the embedany memory model.  We need to change
that.)  Here's the code compiled with "cc -mtune=ultrasparc -O3":

main:
        save    %sp, -104, %sp
        sethi   %hi(a), %o1
        sethi   %hi(b), %o2
        ld      [%o1+%lo(a)], %o0
        ld      [%o2+%lo(b)], %o1
        call    .umul, 0
         nop
        ret
        restore %g0, %o0, %o0

So what have we learned?  Well, we are getting some performance boost
from using the multiply instruction over calling the library routine.
But other parts of the code are poorly tuned.  Also, tuning is separate
from the instructions being used.  So we should be able to improve performance
without losing v7 compatibility by using "-mtune=supersparc" (or 
"-mtune=ultrasparc" since it has more functional units and should provide
more parallelism).

Eduardo