port-powerpc: Re: Followup: MCHK exception in -current with MMU off

Subject: Re: Followup: MCHK exception in -current with MMU off
To: =?iso-8859-1?q?Timo_Sch=F6ler?= <timo.schoeler@macfinity.net>
From: Nathan J. Williams <nathanw@wasabisystems.com>
List: port-powerpc
Date: 04/10/2005 17:52:23
"Nathan J. Williams" <nathanw@wasabisystems.com> writes:

> Timo Schoeler <wanker4freedom@web.de> writes:
> 
> > if i can do anything to help fixing this, please mail me.
> 
> It's kind of a pain, but can you use a binary-search tecnhique to
> determine which .o files are causing the problem? That is, build half
> of them with -O3 and half with -O0, and if it works, move half of the
> -O3 to -O0, and so on.
> 
> (My money is on some inline asm or frame-handling code being treated
> differently, but it's probably better to figure out which file is
> problematic than to go looking at all the inline asm...)

I looked into this a bit, and did some binary searching; the culprit
is altivec.o (from arch/powerpc/powerpc/altivec.c), specifically,
the function vzeropage(). Sure enough, it's more than half inline
asm.
 
A backtrace (over the serial line) also gives you the clue that that's
where the problem is; it looks like this:

trap: pid 4.1 (atabus1): kernel MCHK trap @ 0x2af930 (SRR1=0x2041020)
panic: trap
Stopped in pid 4.1 (atabus1) at netbsd:cpu_Debugger+0x10:       lwz     r0, r1, 0
x14
db> t
0xd521a7f0: at panic+0x19c
0xd521a880: at trap+0xfc
0xd521a900: kernel MCHK trap by vzeropage+0x88: srr1=0x2041020
            r1=0xd521a9c0 cr=0x40424088 xer=0 ctr=0x2927dc
0xd521a9c0: at uvm_unlock_fpageq+0xc
0xd521aa10: at ADBDevTable+0xd2c170
0xd521aa20: at uvm_pagealloc_strat+0x328
0xd521aa80: at uao_get+0x180
0xd521aad0: at uvm_fault+0x438
0xd521ac00: at uvm_fault_wire+0x90
0xd521ac30: at uvm_lwp_fork+0x64
0xd521ac60: at newlwp+0x134
0xd521aca0: at fork1+0x394
0xd521ad00: at kthread_create1+0x58
0xd521adb0: at scsipi_create_completion_thread+0x38
0xd521add0: at kthread_create+0x44
0xd521adf0: at scsipi_channel_init+0x5c
0xd521ae10: at atapibusattach+0x50
0xd521ae30: at config_attach_loc+0x2d4
0xd521ae90: at config_found_sm_loc+0x64
0xd521aeb0: at wdc_atapibus_attach+0xd8
0xd521aed0: at atabusconfig+0x320
0xd521af20: at atabus_thread+0x78
0xd521af40: at cpu_switchto+0x44
saved LR(0x73622d3a) is invalid.
db> 

The problem is that the inner loop of vzeropage() is running with the
data MMU turned off, so it can't load from any C variables in
memory. It's using two C variables - pa and ea. When the code is
optimized, GCC keeps both of those variables in registers, and
everything works. At -O0, however, GCC always loads variables from
their location on the stack, which bombs out. I think it's getting an
invalid value in the load and then dying when it tries to run the
altivec store "stvx" to that address. You can see this by making
vzeropage() print out &pa before it turns off PSL_DR; the bomb-out
happens the first time that vzeropage() is called from the mapped
kernel stack of a kernel task, rather than the 1:1 kernel stack of the
initial task.

The solution is to rewrite the inner loop entirely in assembly, to
ensure that the compiler doesn't try anything inappropriate while the
DMMU is off (It looks like pmap_subr.c should have the same problem in
the -O0 case for the non-altivec pmap_zero_page and pmap_copy_page
code, but empirically, it doesn't seem to be an issue. I'm not sure
why not).

        - Nathan