port-sparc64: Re: The dreaded thread bug [was Re: Stable again?]

Subject: Re: The dreaded thread bug [was Re: Stable again?]
To: None <port-sparc64@NetBSD.org>
From: Geoff Adams <gsa-netbsd@alldestroying.com>
List: port-sparc64
Date: 10/27/2006 20:02:25

Thanks a lot for the great pointers.

I'm still working through a lot of the details, and figuring out a  
lot of things. I've got more digging to do before I start asking the  
really interesting questions, but I do have a couple:

- In trap.c (in both sparc and sparc64), mention is made of  
mem_access_fault(), through which MMU-related traps go, instead of  
trap(). But this function doesn't seem to exist. Is that a vestige of  
a former design? I guess it's probably text_access_fault() and  
data_access_fault(), now.

And, of course, the big question looming in my mind:

- If this is related to trap handling, why does this happen only when  
executing threaded processes? Surely we take a similar number and  
type of traps during execution of threaded and non-threaded  
processes, and non-threaded processes can run for years, literally.  
The handling for window overflows that occur as a result of a trap  
(on pre-v9) must already be handled properly, since this will come up  
fairly routinely in normal (non-threaded) execution, no?

I'm still reading nathanw's paper on Scheduler Activations in NetBSD,  
but I suspect that the real difference here comes in how we pass  
things up through an upcall. Or do we use more software traps to  
maintain the various bits of thread state, and something's going  
wrong somewhere in there?

I'll agree that the evidence does point to a window restore problem.

Still perusing locore.s and letting things percolate in my mind...

- Geoff