Subject: Re: mips kernel profiling
To: Jonathan Stone <jonathan@DSG.Stanford.EDU>
From: Simon Burge <simonb@netbsd.org>
List: port-mips
Date: 04/18/2000 13:40:24
Jonathan Stone wrote:

> Wow, apologies for the delay.  Seems like I'm not getting port-mips
> mail these days?!?...
> 
> 
> >_splset is a LEAF function, so it calls MCOUNT.  From what I understand,
> >we shouldn't profile the profiling support :)
> 
> Yep.  Or your stack runneth over. ;).
> 
> 
> When _splset() was introduced, we should've created non-profiled
> entrypoints, say __splhigh() and __splx() entrypoint, and changed
> mips/include/profile.h to do
> 
>    #define MCOUNT_ENTER s = __splhigh();
>    #define MCOUNT_EXIT  __splx(s);
> 
> 
> (or _splhigh_()/_splx_(), whatever works best with ANSI namespace
> rules).  _KERNEL_MCOUNT_DECL should change to match.
> 
> One way to do this is to use XLEAF() to add alias entrypoints after
> the profiling goop emitted by the LEAF() macros.  That's what the
> locore code used to do with splhigh/_splhigh, once upon a time.

I'm currently running with assembly MCOUNT_{ENTER,EXIT} to save the
function call overhead.  I earlier had non-profiled _spl*() routines.
Which would you say is best?

Index: profile.h
===================================================================
RCS file: /cvsroot/syssrc/sys/arch/mips/include/profile.h,v
retrieving revision 1.13
diff -p -u -r1.13 profile.h
--- profile.h	2000/03/28 02:58:46	1.13
+++ profile.h	2000/04/18 03:29:46
@@ -42,28 +42,13 @@
 #define _MIPS_PROFILE_H_
 
 #ifdef _KERNEL
- /*
-  *  Declare non-profiled _splhigh() /_splx() entrypoints for _mcount.
-  *  see MCOUNT_ENTER and MCOUNT_EXIT.
-  */
-#define	_KERNEL_MCOUNT_DECL 		\
-	int _splhigh __P((void));	\
-	int _splx __P((int));
-#else   /* !_KERNEL */
-/* Make __mcount static. */
-#define	_KERNEL_MCOUNT_DECL	static
-#endif	/* !_KERNEL */
-
-#ifdef _KERNEL
 # define _PROF_CPLOAD	""
 #else
 # define _PROF_CPLOAD	".cpload $25;"
 #endif
 
-
 #define	_MCOUNT_DECL \
-    _KERNEL_MCOUNT_DECL \
-    void __attribute__((unused)) __mcount
+    static void __attribute__((unused)) __mcount
 
 #define	MCOUNT \
 	__asm__(".globl _mcount;" \
@@ -72,6 +57,7 @@
 	".set noreorder;" \
 	".set noat;" \
 	_PROF_CPLOAD \
+	"subu $29,$29,16;" \
 	"sw $4,8($29);" \
 	"sw $5,12($29);" \
 	"sw $6,16($29);" \
@@ -87,7 +73,7 @@
 	"lw $7,20($29);" \
 	"lw $31,4($29);" \
 	"lw $1,0($29);" \
-	"addu $29,$29,8;" \
+	"addu $29,$29,24;" \
 	"j $31;" \
 	"move $31,$1;" \
 	".set reorder;" \
@@ -95,14 +81,38 @@
 
 #ifdef _KERNEL
 /*
- * The following two macros do splhigh and splx respectively.
- * They have to be defined this way because these are real
- * functions on the MIPS, and we do not want to invoke mcount
- * recursively.
+ * Block interrupts during mcount so that those interrupts can also be
+ * counted (as soon as we get done with the current counting).
  */
-#define	MCOUNT_ENTER	s = _splhigh()
 
-#define	MCOUNT_EXIT	_splx(s)
-#endif /* _KERNEL */
+/* $1 is at, $8 is t0, $12 is MIPS_COP_0_STATUS */
+#define	MCOUNT_ENTER	__asm__( \
+	".set	noat;" \
+	".set	noreorder;" \
+	"mfc0	$1,$12;" \
+	"nop;" \
+	"andi	%0,$1,1;" \
+	"beq	$1,$0,1f;" \
+	"li	$8,-2;" \
+	"and	$1,$1,$8;" \
+	"mtc0	$1,$12;" \
+	"nop;" \
+	"1:;" \
+	".set	at;" \
+	".set	reorder" : "=g" (s) :: "t0", "at");
+
+#define	MCOUNT_EXIT	__asm__( \
+	".set	noat;" \
+	".set	noreorder;" \
+	"beq	%0,$0,1f;" \
+	"mfc0	$1,$12;" \
+	"nop;" \
+	"ori	$1,$1,1;" \
+	"mtc0	$1,$12;" \
+	"nop;" \
+	"1:;" \
+	".set	at;" \
+	".set	reorder" :: "g" (s) : "at");
 
+#endif /* _KERNEL */
 #endif /* _MIPS_PROFILE_H_ */

> *Sigh*.  Its a real shame kernel profiling keeps getting busted.  That
> suggests that kernel changes being arent being adequately profiled
> before they get committed.  NetBSD/pmax used to be enough faster than
> the alternatives that some large campuses switched servers just for
> the performance improvement.  I wonder if that's still true.
> 
> Simon -- can you run lmbench binaries on both Ultrix and NetBSD,
> on a 60Mhz r4400?

Overall not too bad.  The process exec time is probably the worst for
NetBSD.  In this case, the Ultrix box had no local filesystems, so
pretty much ignore the file benchmarks.

                 L M B E N C H  1 . 9   S U M M A R Y
                 ------------------------------------
                 (Alpha software, do not distribute)

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 OS  Mhz null null      open selct sig  sig  fork exec sh  
                             call  I/O stat clos       inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
mips-dec-    ULTRIX 4.5  117  3.7  28.   80   99 0.39K 13.8   41 5.6K  14K  30K
pmax-netb   NetBSD 1.4X  118  3.5  17.  105  124 0.31K  8.2   26 5.0K  37K  62K

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
mips-dec-    ULTRIX 4.5   46    356    963   251   1493     296    1738
pmax-netb   NetBSD 1.4X   18    339    754   284   1198     308    1714

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
mips-dec-    ULTRIX 4.5    46   101  146   404         302       1678
pmax-netb   NetBSD 1.4X    18   131  123   383         458       1882

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot    Page       
                        Create Delete Create Delete  Latency Fault   Fault 
--------- ------------- ------ ------ ------ ------  ------- -----   ----- 
mips-dec-    ULTRIX 4.5    189     61   1265    128        0              
pmax-netb   NetBSD 1.4X   2941   1136   5555   3030   158314          6.6K

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
mips-dec-    ULTRIX 4.5   16   10   -1      1      0     10      9   23    18
pmax-netb   NetBSD 1.4X   10   12    7      9     24     10     10   24    18

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------   ---  ----   ----    --------    -------
mips-dec-    ULTRIX 4.5   117    23    281        1269
pmax-netb   NetBSD 1.4X   118    25    291        1251

Simon.