Subject: Re: 1.5S vs sparc/MP
To: None <tech-smp@netbsd.org>
From: Simon J. Gerraty <sjg@quick.com.au>
List: tech-smp
Date: 02/27/2001 00:30:55
[summarizing progress for those interested]

> Ok, I decided to try my idea - for avoiding the panic seen on MP
> hypersparcs.   I added semaphores to the kernel, and
> added appropriate operations to the sparc flush routines.

Ok, so with smp_chache_flush() et al, using a semaphore to ensure the
boot cpu waits for the other(s) to finish flushing, we can get right
up to start_init() with multiple hypersparc cpu's spinning.

It panic'd there because the lockmgr lock  that KERNEL_PROC_UNLOCK(p)
wants to release isn't held.

Thanks to Bill Sommerfeld, who came up with:

Index: sparc/locore.s
===================================================================
RCS file: /cvsroot/syssrc/sys/arch/sparc/sparc/locore.s,v
retrieving revision 1.134
diff -u -p -r1.134 locore.s
--- sparc/locore.s      2000/08/31 16:59:12     1.134
+++ sparc/locore.s      2001/02/27 06:57:35
@@ -4870,6 +4870,9 @@ ENTRY(snapshot)
  * and when returning a child to user mode after a fork(2).
  */
 ENTRY(proc_trampoline)
+#ifdef MULTIPROCESSOR
+       call _C_LABEL(proc_trampoline_mp)
+#endif
        call    %l0                     ! re-use current frame
         mov    %l1, %o0
 
which hopefully takes care of that.

Actually the kernel only gets to start_init() if the semaphore ops do
printfs - which probably force the cpu's to lockstep on the printf
mutex.  If I turn off the printfs we get a watchdog reset much earlier
eg:

qe2 at qec0 slot 2 offset 0x0 rev 1 address 08:00:20:72:58:20
qe3 at qec0 slot 3 offset 0x0 rev 1 address 08:00:20:72:58:20
eccmemctl0 at mainbus0: version 0x0/0x2
scsibus0: waiting 2 seconds for devices to settle...
panic:
Watchdog Reset
Type  help  for more information
<#1> ok 

Note that cpu1 takes the reset.  Apparently we fault in ctw_invalid,
and the value of g7 is bogus.  Not my analysis - no sparc clue.
The data is:

<#1> ok 1 .window
            0        1        2        3        4        5        6        7
IN:  ff000000        0 f02b10f0 ffffffff        6    40000 f6248e18 f0053fa4
LOC: 1e8010c3 f0053f60 f0053f64        e        0       38  1c00000        0
OUT: f6248dc8 f019c644 f028f800 f0291ac8        0 f02939e0 f6248d68 f00067e8
<#1> ok .registers
          %g0      %g1      %g2      %g3      %g4      %g5      %g6      %g7
            0        0 f07a4017        0        0 f62431a0 f6243000      37a
           PC      nPC        Y      PSR      WIM      TBR
     f0006124 f00060fc  1c00000 1e001ee2        2 f0004050
<#1> ok 0 .window
            0        1        2        3        4        5        6        7
IN:  f6248dc8 f019c644 f028f800 f0291ac8        0 f02939e0 f6248d68 f00067e8
LOC: f0002000      830 f1f00000 f1800000 1e001dc0 e0002000        1 f02054d9
OUT: f6248dc8        1 f02b0c00 80a76000 f029387c        4 f6248cf8 f019c69c
<#1> ok 

(gdb) x/i 0xf0006124
0xf0006124 <ctw_invalid+56>:    save  %g5, 0x40, %g5

now ctw_invalid saves g7+1 windows - 37a is likely too high ;-)

There's probably some setup of the secondary cpu's that's not being
done yet.

--sjg