netbsd-bugs: port-sparc/6969: pv_unlink0 panics on a SS20

Subject: port-sparc/6969: pv_unlink0 panics on a SS20
To: None <gnats-bugs@gnats.netbsd.org>
From: None <erik@mediator.uni-c.dk>
List: netbsd-bugs
Date: 02/08/1999 18:31:51
>Number:         6969
>Category:       port-sparc
>Synopsis:       pv_unlink0 panics on a SS20
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    port-sparc-maintainer (NetBSD/sparc Portmaster)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Feb  8 09:35:01 1999
>Last-Modified:
>Originator:     Erik Bertelsen
>Organization:
	UNI-C
>Release:        NetBSD-1.3.3
>Environment:
	
NetBSD femur.ipv6.uni-c.dk 1.3.3 NetBSD 1.3.3 (FEMUR) #5: Mon Feb  1 13:17:47 CET 1999     erik@femur.ipv6.uni-c.dk:/sw/src/sys/arch/sparc/compile/FEMUR sparc



>Description:

A few months ago, I had a correspondence on the port-sparc list and
individually with at least Erik Fair about one of our SparcStations (a
20'er) panic'ed with pv_unlink0 from time to time. Nobody came with
any definitive conclusions, why this happened, except that it was
mostly concluded that this problem was not related to a couple of PR's
by Erik Fair about pv_unlink0 problems on earlier systems (about 1.3,
solved in or before 1.3.2).

I saw a hint from Robert Elz on the list about multiplying sequences
of consecutive nop instructions in the kernel, and I took the liberty
to double all sequences of more than one nop in sparc/locore.s (except
in jump table entries). After doing this, we never saw pvunlink0
panics again, but from time to time the system died with other panics
or (a few times) just became unresponsive, seemingly unable to start
and/or schedule processes. This went on for a few months, with the
machine running NetBSD 1.3.2 (and with INRIA IPv6 code as the only
other change).

After New Year I upgraded to 1.3.3 when the INRIA code became
available. At that time I started with fresh sources, i.e. without my
nop duplications.

At once we started seeing pvunlink0 panics again during medium to
heavy system activity, but only after the system had been up for some
time.  In some of the experiments mentioned below, the typical time to
make it crash was between one and two make clean&&make's in
/usr/src/usr.sbin.

I made a printout of locore.s and identified all places in this file with more
than one nop on the same line.

I doubled several nop sequences, still got the pvunlink0 panics and
reverted those doublings.

Finally I doubled the line with 3 nops in CHECK_SP_REDZONE, and the
dying of the machine changed behaviour (a different panic, but I don't
have the details here). Then I made the more radical changed shown
here:

diff -c -r1.1.1.1 locore.s
*** locore.s    1999/01/03 11:51:12     1.1.1.1
--- locore.s    1999/01/17 19:07:15
***************
*** 1197,1204 ****
--- 1197,1208 ----
        rd      %psr, t1;               /* t1 = splhigh() */ \
        or      t1, PSR_PIL, t2; \
        wr      t2, 0, %psr; \
+       nop; nop; nop; /* SS 20 panic? */ \
+       nop; nop; nop; /* SS 20 panic? */ \
        wr      t2, PSR_ET, %psr;       /* turn on traps */ \
        nop; nop; nop; \
+       nop; nop; nop; /* SS 20 panic? */ \
+       nop; nop; nop; /* SS 20 panic? */ \
        save    %sp, -CCFSZ, %sp;       /* preserve current window */ \
        sethi   %hi(Lpanic_red), %o0; \
        call    _panic; or %o0, %lo(Lpanic_red), %o0; \

With this patch, the machine has been running stable for several
weeks, and it has literally done douzens of make clean&&make's in
/usr/src/usr.sbin, where 1-2 of these used to be enough to kill the
machine. Actually the system has been rebuilt completely since the
original writing of these sentences, still no system failures.

As the patch adds nop instructions to code that never gets executed,
this patch does not directly reveal what the real problem is, at it
may not even be a correct patch, but it seems to work "just all right"
on this machine.

I have received messages suggesting that the problem may lie in some
other behavior, such as cache line positions. Chris Torek suggested that
it may be related to:

  "The SS20 uses the TI "Viking" CPU chip.  (Some models use a "Voyager"
   instead; I think the following applies only to the Viking.)  This
   chip has a level-1 D-cache that can participate in I/O.  When the
   chip is in write-through mode (as on any dual-processor SS20, or 
   any machine with the MXCC and Ecache) all is okay, but when it is
   write-back mode (as on a single processor box with no Ecache), the
   D-cache must interact with I/O transactions.  Early versions of   
   the Viking have bugs in this hardware.  (They have a bunch of other
   bugs too, but this is the nastiest I know about, as it winds up
   corrupting I/O transactions and possibly putting bad data into the
   cache.)"

I enclose dmesg output for reference:


NetBSD 1.3.3 (FEMUR) #5: Mon Feb  1 13:17:47 CET 1999
    erik@femur.ipv6.uni-c.dk:/sw/src/sys/arch/sparc/compile/FEMUR
real mem = 66256896
avail mem = 59822080
using 808 buffers containing 3309568 bytes of memory
bootpath: /iommu@f,e0000000/sbus@f,e0001000/espdma@f,400000/esp@f,800000/sd@3,0
mainbus0 (root): SUNW,SPARCstation-20
cpu0 at mainbus0: mid 8: TMS390Z50 v0 or TMS390Z55 @ 60 MHz, on-chip FPU
cpu0: physical 20K instruction (64 b/l), 16K data (32 b/l), 1024K external (32 b/l): cache enabled
cpu at mainbus0 not configured
obio0 at mainbus0
clock0 at obio0 addr 0xf1200000: mk48t08 (eeprom)
timer0 at obio0 addr 0xf1300000 delay constant 28
[zs at obio0] addr 0xf1100000 not configured
zs1 at obio0 addr 0xf1000000 pri 12, softpri 6
kbd0 at zs1 channel 0 (console)
ms0 at zs1 channel 1
[SUNW,fdtwo at obio0] addr 0xf1700000 not configured
auxreg0 at obio0 addr 0xf1800000
power0 at obio0 addr 0xf1a01000
cgfourteen0 at obio0 addr 0x9c000000: cgthree emulated at 1152x900x8bpp (console)
cgfourteen0: attached to /dev/fb
[cgfourteen at obio0] addr 0x90000000 not configured
iommu0 at mainbus0 ioaddr 0xe0000000: version 0x1/0x1, page-size 4096, range 64MB
sbus0 at iommu0: clock = 25 MHz
dma0 at sbus0 slot 15 offset 0x400000: rev 2
esp0 at dma0 slot 0xf offset 0x800000 pri 4: ESP200, 40MHz, SCSI ID 7
scsibus0 at esp0: 8 targets
probe(esp0:1:0): max sync rate 10.00Mb/s
sd1 at scsibus0 targ 1 lun 0: <SEAGATE, ST31200W SUN1.05, 8724> SCSI2 0/direct fixed
sd1: 1006MB, 2700 cyl, 9 head, 84 sec, 512 bytes/sect x 2061108 sectors
probe(esp0:3:0): max sync rate 10.00Mb/s
sd0 at scsibus0 targ 3 lun 0: <CONNER, CP30548  SUN0535, B0CD> SCSI2 0/direct fixed
sd0: 517MB, 2242 cyl, 6 head, 78 sec, 512 bytes/sect x 1059528 sectors
probe(esp0:6:0): max sync rate 4.23Mb/s
cd0 at scsibus0 targ 6 lun 0: <TOSHIBA, XM-4101TASUNSLCD, 1084> SCSI2 5/cdrom removable
ledma0 at sbus0 slot 15 offset 0x400010: rev 2
le0 at ledma0 slot 0xf offset 0xc00000 pri 6: address 08:00:20:1c:9d:58
le0: 8 receive buffers, 2 transmit buffers
SUNW,bpp at sbus0 slot 15 offset 0x4800000 not configured
SUNW,DBRIe at sbus0 slot 14 offset 0x10000 not configured
SUNW,rtvc at sbus0 slot 0 offset 0x0 not configured
FORE,sba-200 at sbus0 slot 1 offset 0x800000 not configured
root on sd0a dumps on sd0b
mountroot: trying ffs...
root file system type: ffs
init: copying out path `/sbin/init' 11

>How-To-Repeat:
	
>Fix:
	Really none, but see above...

regards
Erik Bertelsen
>Audit-Trail:
>Unformatted: