Subject: Fixing the R4000 end of page bug
To: None <port-mips@netbsd.org>
From: Stephen M. Rumble <stephen.rumble@utoronto.ca>
List: port-mips
Date: 07/27/2006 07:49:53
Hello all,

I've begun looking into fixing some long-standing problems with revision 2.2
R4000 CPUs on SGI systems and would like any input whatsoever from those more
experienced with MIPS. The most prominent bug in this revision occurs when a
branch or jump exists as the last instruction in a page, the following page
(containing the delay slot instruction) is not mapped, and a few other
conditions (including a data cache miss) are met by the two prior instructions.
The errata sheet doesn't seem terribly clear as to what all of the conditions
are, but identifying a jump or branch in the last slot suffices in finding a
potentially vulnerable instruction sequence. I would start here and attempt to
make the test more specific afterwards so as not to work around unproblematic
pages unnecessarily.

The seemingly obvious fix is to guarantee that the following page is always
mapped in the TLB when the troublesome page also exists there. In looking at the
code, this appears fairly intrusive and also rather complicated. We'd need to go
to pains to swap in the next page when servicing a lookup on a bad page, remove
wired mappings when switching contexts, remove the wired mapping when replacing
a problematic mapped page, deal with consecutive pages that may be problematic,
etc, etc. And even then, theoretically a program could have N consecutive bad
pages and require N wired entries in the TLB. The best solution is to ensure
that a jump never occurs on an end of page boundary thus requiring no kernel
workarounds to be enabled, but the gnu toolchain doesn't appear to support this,
though other, less serious workarounds are now in gcc4.

In trying to consider an alternative solution, I thought of dynamically altering
the troublesome sequence when it is placed into the pmap via pmap_enter() under
the assumption that changing the sequence would avoid the problem entirely. The
instruction immediately preceding the branch or jump could be changed into a
syscall instruction. When executed, the syscall handler would hook very early
into the fixup code, replace the syscall with the real instruction, modify the
jump/branch instruction to be another syscall, and then resume execution. The
next syscall would then set the second to last instruction back to a syscall,
replace the proper jump/branch, and resume execution.

For example, the errata sheet gives the following as a bad sequence:
  original:     pmap_enter():    syscallA:       syscallB (same as pmap_enter):
   lw     -->     lw       -->     lw       -->    lw
   div    -->     syscallA -->     div      -->    syscallA
   beq    -->     beq      -->     syscallB -->    beq
-------- PAGE BOUNDARY

I think that the position of syscallA is guaranteed never to be a jump or branch
(if it were, the problem wouldn't present itself), thus maintaining the state
transitions.

In doing this, we need only add a hook to pmap_enter, a hook in syscall_plain
and syscall_fancy, a call somewhere in mips_machdep.c to determine if the bug is
present, and a call in pmap_destroy to clean up. We'd also need to be able to
maintain lists of replaced instructions (hence needing the clean up), but that
could probably be internalised in the fixup code. All of these calls would be
conditionally compiled and the function calls macroed out when not needed.
There'd be virtually no overhead for systems without this bug, though depending
on the jump instruction placement of a given program, slowdown on broken
machines could be fairly significant. The benefit, though, is that our TLB code
is uncomplicated and we needn't wire pages that might rarely be accessed. In the
end, I'd like to look into fixing the toolchain to avoid creating this
problematic code, and thus avoid the slowdown altogether. If a program runs too
slowly, somebody can just recompile it, or maybe run it through a more
intelligent fixup program, similar to IRIX's 'mipscheck'.

Any suggestions, opinions, or problems foreseen in my idea? Would the TLB
solution be simpler than I imagine? Must I pay special attention to caches when
tweaking instructions during the syscall? Perhaps somebody has a better idea
entirely. I'm open to all of them and criticisms as well.

Thanks,
Steve