port-powerpc: Re: Followup: MCHK exception in -current with MMU off

Subject: Re: Followup: MCHK exception in -current with MMU off
To: Tim Kelly <hockey@dialectronics.com>
From: Nathan J. Williams <nathanw@wasabisystems.com>
List: port-powerpc
Date: 04/11/2005 00:28:52

Tim Kelly <hockey@dialectronics.com> writes:

> At 5:52 PM -0400 4/10/05, Nathan J. Williams wrote:
> > (It looks like pmap_subr.c should have the same problem in
> >the -O0 case for the non-altivec pmap_zero_page and pmap_copy_page
> >code, but empirically, it doesn't seem to be an issue. I'm not sure
> >why not).
> 
> As I'd identified in my original post
> (http://mail-index.netbsd.org/port-powerpc/2005/03/24/0000.html)
> it is a problem in pmap - pmap_syncicache.

I don't belive this is correct. As I mentioned, I added a simple
printf() to vzeropage() that printed out both the pa it's trying to
zero and the address of the pa variable - essentially, the location of
the stack at that point. Here's some output:

wsmux1: connecting to wsdisplay0
vzeropage: pa 00dec000, &pa 0x410dc8
vzeropage: pa 00deb000, &pa 0x410dc8
vzeropage: pa 00dea000, &pa 0x410dc8
[ many more deleted ] 
vzeropage: pa 00db2000, &pa 0x410aa8
vzeropage: pa 00db1000, &pa 0x410aa8
scsibus0: waiting 2 seconds for devices to settle...

...

wd0 at atabus0 drive 0: <IBM-DPTA-372730>
vzeropage: pa 00daa000, &pa 0xd521a9c8
trap: pid 4.1 (atabus1): kernel MCHK trap @ 0x2af94c (SRR1=0x2041020)
panic: trap
Stopped in pid 4.1 (atabus1) at netbsd:cpu_Debugger+0x10:       lwz r0, r1, 0

db>  t
0xd521a7f0: at panic+0x19c
0xd521a880: at trap+0xfc
0xd521a900: kernel MCHK trap by vzeropage+0xa4: srr1=0x2041020
            r1=0xd521a9c0 cr=0x20424088 xer=0 ctr=0x2c0af8
0xd521a9c0: at vzeropage+0x48
...

With the DMMU off, trying to load from the address 0xd521a9c8 doesn't
work; the only addresses that work here are physical addresses, and
the 0xd521a000 page is well up in kernel virtual address space
(physical memory is at the bottom of the address space). This is
further confirmed by the fact that the TEA bit is set in SRR1 (the
00040000 bit); this is an external error from the memory controller
saying "hey, there's no memory there!".

Your analysis is flawed because the set of registers you see in DDB
after a panic() reflects their state when panic() is called, not when
the fatal trap happens. Stick a printf of frame->fixreg[9] before
the panic("trap") in trap.c to see what I mean.

Also, the machine check isn't guaranteed to be on exactly the faulting
instruction; it's a best-effort kind of thing. It may be a couple of
instructions late, and I think that's happening here.

        - Nathan