Subject: Re: 1.5S vs sparc/MP
To: None <tech-smp@netbsd.org>
From: Simon J. Gerraty <sjg@quick.com.au>
List: tech-smp
Date: 02/27/2001 00:30:55
[summarizing progress for those interested]
> Ok, I decided to try my idea - for avoiding the panic seen on MP
> hypersparcs. I added semaphores to the kernel, and
> added appropriate operations to the sparc flush routines.
Ok, so with smp_chache_flush() et al, using a semaphore to ensure the
boot cpu waits for the other(s) to finish flushing, we can get right
up to start_init() with multiple hypersparc cpu's spinning.
It panic'd there because the lockmgr lock that KERNEL_PROC_UNLOCK(p)
wants to release isn't held.
Thanks to Bill Sommerfeld, who came up with:
Index: sparc/locore.s
===================================================================
RCS file: /cvsroot/syssrc/sys/arch/sparc/sparc/locore.s,v
retrieving revision 1.134
diff -u -p -r1.134 locore.s
--- sparc/locore.s 2000/08/31 16:59:12 1.134
+++ sparc/locore.s 2001/02/27 06:57:35
@@ -4870,6 +4870,9 @@ ENTRY(snapshot)
* and when returning a child to user mode after a fork(2).
*/
ENTRY(proc_trampoline)
+#ifdef MULTIPROCESSOR
+ call _C_LABEL(proc_trampoline_mp)
+#endif
call %l0 ! re-use current frame
mov %l1, %o0
which hopefully takes care of that.
Actually the kernel only gets to start_init() if the semaphore ops do
printfs - which probably force the cpu's to lockstep on the printf
mutex. If I turn off the printfs we get a watchdog reset much earlier
eg:
qe2 at qec0 slot 2 offset 0x0 rev 1 address 08:00:20:72:58:20
qe3 at qec0 slot 3 offset 0x0 rev 1 address 08:00:20:72:58:20
eccmemctl0 at mainbus0: version 0x0/0x2
scsibus0: waiting 2 seconds for devices to settle...
panic:
Watchdog Reset
Type help for more information
<#1> ok
Note that cpu1 takes the reset. Apparently we fault in ctw_invalid,
and the value of g7 is bogus. Not my analysis - no sparc clue.
The data is:
<#1> ok 1 .window
0 1 2 3 4 5 6 7
IN: ff000000 0 f02b10f0 ffffffff 6 40000 f6248e18 f0053fa4
LOC: 1e8010c3 f0053f60 f0053f64 e 0 38 1c00000 0
OUT: f6248dc8 f019c644 f028f800 f0291ac8 0 f02939e0 f6248d68 f00067e8
<#1> ok .registers
%g0 %g1 %g2 %g3 %g4 %g5 %g6 %g7
0 0 f07a4017 0 0 f62431a0 f6243000 37a
PC nPC Y PSR WIM TBR
f0006124 f00060fc 1c00000 1e001ee2 2 f0004050
<#1> ok 0 .window
0 1 2 3 4 5 6 7
IN: f6248dc8 f019c644 f028f800 f0291ac8 0 f02939e0 f6248d68 f00067e8
LOC: f0002000 830 f1f00000 f1800000 1e001dc0 e0002000 1 f02054d9
OUT: f6248dc8 1 f02b0c00 80a76000 f029387c 4 f6248cf8 f019c69c
<#1> ok
(gdb) x/i 0xf0006124
0xf0006124 <ctw_invalid+56>: save %g5, 0x40, %g5
now ctw_invalid saves g7+1 windows - 37a is likely too high ;-)
There's probably some setup of the secondary cpu's that's not being
done yet.
--sjg