Subject: Re: 1.6 woes (pmap vs. UBC?)
To: Jukka Marin <jmarin@pyy.jmp.fi>
From: Greg A. Woods <woods@weird.com>
List: port-sparc
Date: 08/05/2002 18:59:58
[ On Monday, July 29, 2002 at 08:20:02 (+0300), Jukka Marin wrote: ]
> Subject: Re: 1.6 woes
>
> On Sun, Jul 28, 2002 at 07:08:59PM -0400, Greg A. Woods wrote:
> > I've run for two weeks without a crash.  Other times it won't stay
> > running long enough for me to login.  Only very rarely will some other
> > program die, such as snmpd, though I can't be sure those other programs
> > are not just buggy (I've only done more extensive debugging with XsunMono).
> > 
> > I'm willing to try the patch, but if anyone's seen even one error since
> > applying it then I'd have to suggest that it's not really the solution.
> 
> Well, after several hours, another Sparc managed to build that pine source
> (it didn't crash even once, it just took 5 hours or more), so maybe the
> signal 10's on the other machine were caused by some hardware problem.
> Manuel's patch enabled me to build things and other than those signals on
> one machine, everything has been stable with the patch.

Today I managed to get around to compiling a kernel with Manuel's patch.

My formerly speedy SS-1+ is now crawling like a snail, even just running
as a diskless workstation!  :-)

I killed the local instance of swisswatch which was ticking every second
and have run it on a remote machine, and that helped a lot, presumably
because now there's one less context switch happening every second.  I'm
almost thinking of doing the same with my window manager too!  (Though
I'll run swisswatch locally again while I'm not sitting in front of the
machine -- the Xserver quite often crashes even with no keyboard/mouse
input occuring.)

I guess that's to be expected, given what the patch does.  It seems
every context switch now really costs big-time.

I've modified the patch (my version is attached) so that I can hopefully
continue to share my source tree with sun4m builds, though obviously
this is not an acceptable solution, if indeed it even is a complete fix.

I'll really push my use of this machine and see if anything crashes over
the next few days (it's only been running three hours so far, and as
I've noted before, it's been known to run fine for weeks under light use).

What's interesting about this change is that if indeed it fixes the
problem there's some indication that the failure is perhaps not a 100%
certain condition, but rather still highly timing dependent.  I wonder
how often the offending block of code is encountered vs. how often
me_alloc() is called.  It would seem from the profiling results that
it's almost 1:1, at least it seems so for each process context switch.

Also interesting is that if there's some critical timing issue with
cache_flush_segment() then what about the other invocations?  Are they
similarly vulnerable too?  What could this problem be?

Are other sun4c users really avoiding newer NetBSD?  Is that why so few
people have encountered this problem?

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


Index: pmap.c
===================================================================
RCS file: /cvs/NetBSD/src/sys/arch/sparc/sparc/pmap.c,v
retrieving revision 1.1.1.8
diff -c -c -r1.1.1.8 pmap.c
*** pmap.c	12 Jun 2001 22:18:11 -0000	1.1.1.8
--- pmap.c	5 Aug 2002 18:29:14 -0000
***************
*** 1313,1320 ****
  	ctx = getcontext4();
  	if (CTX_USABLE(pm,rp)) {
  		CHANGE_CONTEXTS(ctx, pm->pm_ctxnum);
! 		cache_flush_segment(me->me_vreg, me->me_vseg);
! 		va = VSTOVA(me->me_vreg,me->me_vseg);
  	} else {
  		CHANGE_CONTEXTS(ctx, 0);
  		if (HASSUN4_MMU3L)
--- 1313,1344 ----
  	ctx = getcontext4();
  	if (CTX_USABLE(pm,rp)) {
  		CHANGE_CONTEXTS(ctx, pm->pm_ctxnum);
! #if defined(SUN4C)
! 		/*
! 		 * see http://mail-index.netbsd.org/port-sparc/2002/07/27/0001.html
! 		 */
! 		if (CPU_ISSUN4C) {
! 			va = VSTOVA(me->me_vreg,me->me_vseg);
! 			/*
! 			 * WARNING:  This is _REALLY_ slow....
! 			 *
! 			 * Maybe it should also do this test:
! 			 *
! 			 *	(* if cacheable, flush page as needed *)
! 			 *
! 			 *	if ((getpte4(va) & PG_NC) == 0)
! 			 *		cache_flush_page(va + i*NBPG);
! 			 *
! 			 * as is done in some other places?
! 			 */
! 			for (i=0; i < 64; i++)
! 				cache_flush_page(va + i*NBPG);
! 		} else
! #endif
! 		{
! 			cache_flush_segment(me->me_vreg, me->me_vseg);
! 			va = VSTOVA(me->me_vreg,me->me_vseg);
! 		}
  	} else {
  		CHANGE_CONTEXTS(ctx, 0);
  		if (HASSUN4_MMU3L)