Subject: port-sparc/20962: Recently updated ss20/hs sparc/mp stops with "Watchdog Reset"
To: None <gnats-bugs@gnats.netbsd.org>
From: None <he@netbsd.org>
List: netbsd-bugs
Date: 03/30/2003 22:08:58
>Number:         20962
>Category:       port-sparc
>Synopsis:       Recently updated ss20/hs sparc/mp stops with "Watchdog Reset"
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-sparc-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Mar 30 12:10:01 PST 2003
>Closed-Date:
>Last-Modified:
>Originator:     Havard Eidnes
>Release:        NetBSD 1.6Q Mar 29 2003
>Organization:
	Unorganized, Inc.
>Environment:
System: NetBSD grizzly.urc.uninett.no 1.6Q NetBSD 1.6Q (GENERIC.MP) #4: Sun Mar 30 17:04:21 CEST 2003  he@grizzly.urc.uninett.no:/sys/arch/sparc/compile/GENERIC.MP sparc
Architecture: sparc
Machine: sparc
>Description:
	Earlier my sparc has occasionally spewed the odd

	xcall(cpu1,0xf02656e0): couldn't ping cpus:xcall(cpu1,0xf02656e0): couldn't ping cpus:

	messages (i.e. the message is never followed by a newline),
	but the machine has been keeping itself up and running despite
	these messages.

	Recently it appears that the intensity of these messages has
	increased (?), but this time the machine is exiting to the
	PROM with "Watchdog Reset".  The watchdog reset *may* be
	coincidental(?)

	So, to get some more information about which CPU is trying to
	be xcall'ed, I modified the code to print the "cpuset"
	variable like so:

Index: cpu.c
===================================================================
RCS file: /cvsroot/src/sys/arch/sparc/sparc/cpu.c,v
retrieving revision 1.174
diff -u -r1.174 cpu.c
--- cpu.c       2003/02/26 17:39:07     1.174
+++ cpu.c       2003/03/30 19:52:10
@@ -722,8 +722,8 @@
        i = 10000;      /* time-out, not too long, but still an _AGE_ */
        while (!done) {
                if (--i < 0) {
-                       printf_nolog("xcall(cpu%d,%p): couldn't ping cpus:",
-                           cpu_number(), func);
+                       printf_nolog("xcall(cpu%d,%p): couldn't ping cpus, cpuse
t=%x\n",
+                           cpu_number(), func, cpuset);
                }
 
                done = 1;
@@ -735,7 +735,7 @@
 
                        if (cpi->msg.complete == 0) {
                                if (i < 0) {
-                                       printf_nolog(" cpu%d", cpi->ci_cpuid);
+                                       printf_nolog("xcall failed to cpu%d\n", 
cpi->ci_cpuid);
                                } else {
                                        done = 0;
                                        break;

	and I now get:

NetBSD/sparc (grizzly.urc.uninett.no) (console)

login: xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
xcall(cpu1,0xf02656e4): couldn't ping cpus, cpuset=1
Asy0
Watchdog Reset
Type  help  for more information
<#2> ok nmi_hard: SMP botch.cpu0: NMI: system interrupts: 80000<VME=0,SBUS=0,T>
Level 15 Interrupt
<#2> ok

	So, it's trying to xcall the other CPU (cpu0, bit 0 set in
	cpuset), and apparently failing.

	A minimal set of debugging info from the PROM monitor:

<#2> ok ctrace
PC: f02658fc 
Last leaf: call 10042cfc    from 10042a80 
     0 w  %o0-%o5: (        0 ffffffe0       27 f0002000      200       20 )

jmpl  f0265884    from f0223e28 
     1 w  %o0-%o5: (        a        0       27        0 f5d02000 20000000 )

call f0223cac    from f0265d44 
     2 w  %o0-%o5: ( f0265884        a        4        1        1        1 )

jmpl  f0265d28    from f02651c8 
     3 w  %o0-%o5: (        a        4       27 f0265d28    3f000        3 )

call f0265190    from f026639c 
     4 w  %o0-%o5: (        a        4        2 f682d000        2 f044f6d8 )

call f0266228    from f0276ce8 
     5 w  %o0-%o5: (        0 f682efb0        0     45ec ffffffff      805 )

call f0276a1c    from f0008694 
     6 w  %o0-%o5: (       25 1e000082 10042d70 f682efb0        0    3e000 )

XXXXXXX    from 54ff4   
     7 w  %o0-%o5: (  3040873 81c06174  3000000 81c06000    10ea4       91 )

call 13f70    from 13a8c 
     8 w  %o0-%o5: (    3f1a8        2    13a88    3ec00    3e000        0 )

call 13970    from 13a48 
     9 w  %o0-%o5: (    3f1a8        2    13a10    3f000    42040       27 )

call 13970    from 1392c 
     a w  %o0-%o5: (    43214        0      1f8    3e000    3f000        4 )

call 13908    from 1e564 
     b w  %o0-%o5: (    3f000        0        0    3e000    3f000        3 )

call 1e224    from 11adc 
     c w  %o0-%o5: (    3e090 effff29c effff2ac       10        1 10049270 )

call 11a18    from 11a08 
     d w  %o0-%o5: (        3 effff29c    3e000 10043a9c 1005e000 effffff0 )

<#2> ok 0 .window
            0        1        2        3        4        5        6        7
IN:         a        0       27        0 f5d02000 20000000 f682ece8 f0223e28
LOC:        0    13be4    13934        4        0      300 f682d000 1023f380
OUT:        0 ffffffe0       27 f0002000      200       20 f682ec80 10042a80
<#2> ok .registers
          %g0      %g1      %g2      %g3      %g4      %g5      %g6      %g7
            0  8000000        2 f07eff90    3e800 ffffffff f682d000 f02662b8
           PC      nPC        Y      PSR      WIM      TBR
     f02658fc f0265900     8000 1e500be4        8 f0006090
<#2> ok 

	The kernel part of the stack backtrace appears to be:

(gdb) x/i 0xf02658fc
0xf02658fc <srmmu_cache_flush+120>:     sta  %o0, [ %l0 ] #ASI_AIUP
(gdb) x/i 0xf0265884
0xf0265884 <srmmu_cache_flush>: save  %sp, -104, %sp
(gdb) x/i 0xf0223cac
0xf0223cac <xcall>:     save  %sp, -112, %sp
(gdb) x/i 0xf0265d28
0xf0265d28 <smp_cache_flush>:   save  %sp, -104, %sp
(gdb) x/i 0xf0265190
0xf0265190 <cache_flush>:       save  %sp, -104, %sp
(gdb) x/i 0xf0266228
0xf0266228 <emulinstr>: save  %sp, -128, %sp
(gdb) x/i 0xf0276a1c
0xf0276a1c <trap>:      save  %sp, -168, %sp
(gdb) 

	The dmesg output from my machine is:

Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003
    The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
    The Regents of the University of California.  All rights reserved.

NetBSD 1.6Q (GENERIC.MP) #4: Sun Mar 30 17:04:21 CEST 2003
        he@grizzly.urc.uninett.no:/sys/arch/sparc/compile/GENERIC.MP
total memory = 255 MB
avail memory = 233 MB
using 896 buffers containing 13184 KB of memory
bootpath: /iommu@f,e0000000/sbus@f,e0001000/espdma@f,400000/esp@f,800000/sd@1,0
mainbus0 (root): SUNW,SPARCstation-20: hostid 72797978
cpu0 at mainbus0: mid 8: RT620/625 @ 150 MHz, on-chip FPU
cpu0: 512K byte write-back, 32 bytes/line, sw flush: cache enabled
cpu1 at mainbus0: mid 10: RT620/625 @ 150 MHz, on-chip FPU
cpu1: 512K byte write-back, 32 bytes/line, sw flush: cache enabled
obio0 at mainbus0
clock0 at obio0 slot 0 offset 0x200000: mk48t08
timer0 at obio0 slot 0 offset 0x300000: delay constant 73
zs0 at obio0 slot 0 offset 0x100000 level 12 softpri 6
zstty0 at zs0 channel 0 (console i/o)
zstty1 at zs0 channel 1
zs1 at obio0 slot 0 offset 0x0 level 12 softpri 6
kbd0 at zs1 channel 0: baud rate 1200
ms0 at zs1 channel 1: baud rate 1200
fdc0 at obio0 slot 0 offset 0x700000 level 11 softpri 4: chip 82077
fd0 at fdc0 drive 0: 1.44MB 80 cyl, 2 head, 18 sec
auxreg0 at obio0 slot 0 offset 0x800000
power0 at obio0 slot 0 offset 0xa01000 level 2
iommu0 at mainbus0 ioaddr 0xe0000000: version 0x3/0x1, page-size 4096, range 64MB
sbus0 at iommu0: clock = 25 MHz
dma0 at sbus0 slot 15 offset 0x400000: dma rev 2
esp0 at dma0 slot 15 offset 0x800000 level 4: ESP200, 40MHz, SCSI ID 7
scsibus0 at esp0: 8 targets, 8 luns per target
ledma0 at sbus0 slot 15 offset 0x400010: dma rev 2
le0 at ledma0 slot 15 offset 0xc00000 level 6: address 08:00:20:79:79:78
le0: 8 receive buffers, 2 transmit buffers
bpp0 at sbus0 slot 15 offset 0x4800000 level 2 (ipl 3): dma rev 2
SUNW,DBRIe at sbus0 slot 14 offset 0x10000 level 9 not configured
cgsix0 at sbus0 slot 2 offset 0x0 level 9: SUNW,501-2325, 1152 x 900, rev 11
cgsix0: attached to /dev/fb
eccmemctl0 at mainbus0 ioaddr 0x0: version 0x0/0x2
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 1 lun 0: <QUANTUM, QM39100TD-SCA, N1K0> disk fixed
sd0: 8683 MB, 8057 cyl, 10 head, 220 sec, 512 bytes/sect x 17783249 sectors
sd0: sync (100.0ns offset 15), 8-bit (10.000MB/s) transfers, tagged queueing
sd1 at scsibus0 target 3 lun 0: <QUANTUM, QM39100TD-SCA, N1K0> disk fixed
sd1: 8683 MB, 8057 cyl, 10 head, 220 sec, 512 bytes/sect x 17783249 sectors
sd1: sync (100.0ns offset 15), 8-bit (10.000MB/s) transfers, tagged queueing
cd0 at scsibus0 target 6 lun 0: <TOSHIBA, XM-4101TASUNSLCD, 1755> cdrom removable
cd0: async, 8-bit transfers
root on sd0a dumps on sd0b
root file system type: ffs
cpu0: booting secondary processors: cpu1


>How-To-Repeat:
	All that appears to be required is to do a normal build to
	trigger the problem.

	Code inspection does however reveal that the newline after the
	while loop in:

        if (!done)
                printf_nolog("\n");

	will never be printed because the "done = 0; break;" part
	shortly above will only exit one loop level (the for loop, not
	the enclosing while loop).


>Fix:
	Sorry, don't know.
	Further hints for debugging gratefully accepted.
>Release-Note:
>Audit-Trail:
>Unformatted: