port-sparc: The pv_unlink0 saga continues

Subject: The pv_unlink0 saga continues
To: None <port-sparc@netbsd.org>
From: Erik Bertelsen <erik@mediator.uni-c.dk>
List: port-sparc
Date: 01/17/1999 20:31:01
Hello all,
a few months ago, I had a correspondence with the list and individually
with at least Erik Fair about one of our SparcStations (a 20'er) panic'ed
with pv_unlink0 from time to time. Nobody came with any definitive conclusions,
why this happened, except that it was mostly concluded that this problem
was not related to a couple of PR's by Erik Fair about pv_unlink0 problems
on earlier systems.

I saw a hint from Robert Elz on the list about multiplying sequences of
consecutive nop instructions in the kernel, and I took the liberty to double
all sequences of more than one nop in sparc/locore.s (except in jump table
entries). After doing this, we never saw pvunlink0 panics again, but from
time to time the system died with other panics or (a few times) just became
unresponsive, seemingly unable to start and/or schedule processes. This
went on for a few months, with the machine running NetBSD 1.3.2 (and
with INRIA IPv6 code as the only other change).

Recently I upgraded to 1.3.3 when the INRIA code became available. At that
time I started with fresh sources, i.e. without my nop duplications.

At once we started seeing pvunlink0 panics again during medium to
heavy system activity, but only after the system had been up for some time.
In some of the experiments mentioned below, the typical time to make it crash
was between one and two make clean&&make's in /usr/src/usr.sbin.

I made a printout of locore.s and identified all places in this file with more
than one nop on the same line.

I doubled several nop sequences, still got the pvunlink0 panics and reverted 
those doublings.

Finally I doubled the line with 3 nops in CHECK_SP_REDZONE, and the dying of
the machine changed behaviour (a different panic, but I don't have the details
here). Then I made the more radical changed shown here:

diff -c -r1.1.1.1 locore.s
*** locore.s    1999/01/03 11:51:12     1.1.1.1
--- locore.s    1999/01/17 19:07:15
***************
*** 1197,1204 ****
--- 1197,1208 ----
        rd      %psr, t1;               /* t1 = splhigh() */ \
        or      t1, PSR_PIL, t2; \
        wr      t2, 0, %psr; \
+       nop; nop; nop; /* SS 20 panic? */ \
+       nop; nop; nop; /* SS 20 panic? */ \
        wr      t2, PSR_ET, %psr;       /* turn on traps */ \
        nop; nop; nop; \
+       nop; nop; nop; /* SS 20 panic? */ \
+       nop; nop; nop; /* SS 20 panic? */ \
        save    %sp, -CCFSZ, %sp;       /* preserve current window */ \
        sethi   %hi(Lpanic_red), %o0; \
        call    _panic; or %o0, %lo(Lpanic_red), %o0; \

With this patch, the machine has been up for more than 6 days, and it has literally
done douzens of make clean&&make's in /usr/src/usr.sbin, where 1-2 of these used to be
enough to kill the machine.

I will probably try to reduce the amounts of nop's until it becomes unstable again,
but for the time being I just want to report the current status of this problem,
and to solicit possible feedback about why it happens. 

confused regards
- Erik Bertelsen, UNI-C