Subject: Re: The dreaded thread bug [was Re: Stable again?]
To: None <port-sparc64@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: port-sparc64
Date: 10/26/2006 19:58:37
>      It is indeed related to register windows; having to take a fault
>      while context switching to bring in registers from the stack is
>      a no-no.  How to fix it is a different story.

> Thanks for the pointer.  Now, I need to find a good SPARC trap and
> context switching reference, since this will take me beyond my
> current knowledge very shortly.

Well, I'm not that much of a reference, but I do know a bit about the
SPARC.

The issue here is how the SPARC handles traps/faults/interrupts/etc.
Rather than pushing a lot of machine state on some kind of kernel
stack, the way a lot of machines do, the SPARC just does a window save,
but it ignores the window-invalid mask (%wim) when doing so.  This
means that until you've found somewhere to put register windows, you
can't use save and restore at all, and you have only the locals (%l)
available.  (Your ins (%i) are the outs (%o) of the stack frame that
got interrupted; your outs appear available at first sight, but, unless
you make sure to keep at least two windows invalid at all times, they
may actually be the ins of the bottom window.)

Window overflow and underflow are traps, and as such they work the same
way: they shift into the next window, whether it's marked invalid or
not (the trap window, it's called) and branch to a location depending
on the trap in question.  Thus, until a window of registers has been
spilled *somewhere*, you can't take a fault.  (Actually, for window
underflow, which is what's involved if you're "bring[ing] in registers
from the stack", you probably have lots of invalid windows and thus
don't actually need to put any registers anywhere, but I don't think
that's guaranteed - can't there be some user registers and some kernel
registers both in the hardware at once?)

The hardware disables interrupts when taking a fault/trap/interrupt, so
that an interrupt arriving at a bad time during (say) a window overflow
trap handler doesn't wreck things.  During this time, if the hardware
tries to take a fault/trap, it halts to the OBP - this is the dreaded
"Watchdog reset": a trap while traps are disabled.  Somewhat analogous
to the double error halt on a VAX.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B