Subject: Re: Followup: MCHK exception in -current with MMU off
To: Tim Kelly <hockey@dialectronics.com>
From: Nathan J. Williams <nathanw@wasabisystems.com>
List: port-powerpc
Date: 04/11/2005 12:09:44
Tim Kelly <hockey@dialectronics.com> writes:

> At 12:28 AM -0400 4/11/05, Nathan J. Williams wrote:
> >I don't belive this is correct. As I mentioned, I added a simple
> >printf() to vzeropage() that printed out both the pa it's trying to
> >zero and the address of the pa variable - essentially, the location of
> >the stack at that point. Here's some output:
> 
> Did you run any tests with Altivec not defined? We did, and found an additional
> exception.

I'm quite willing to believe that there's more than one bug. Is that
also with -O0? As I said in my first mail, I think that the
non-Altivec implementation of pmap_copy_page() and pmap_zero_page()
have similar issues with accessing local variables when PSL_DR is
off. However, I wasn't able to trigger that bug with pmap_subr.c
compiled -O0 and pmap_use_altivec forced to 0; I didn't otherwise turn
off Altivec.

> >Your analysis is flawed because the set of registers you see in DDB
> >after a panic() reflects their state when panic() is called, not when
> >the fatal trap happens. Stick a printf of frame->fixreg[9] before
> >the panic("trap") in trap.c to see what I mean.
> 
> Calls into panic mucks registers, such that ddb can't restore them? That's
> real useful for debugging. This needs to be filed as a PR. Gees, even
> MacsBug from eight years ago wasn't this poor.

As Matt said, they're still there on the call stack. The principal use
of the registers in DDB is for looking at the live registers after
hitting a breakpoint and when stepping through code; that's a
situation where DDB is invoked directly by the CPU's exception
mechanics, rather than through trap(). What would be a useful change
here is not to change DDB, but to make the "panic: trap" path dump
more state - after all, DDB might not be there at all, and a register
dump would be useful for post-mortem analysis.

> Beyond the few registers that are used for panic's output strings, why
> would r9 get mucked before it could be recorded? That's really poor.

It is recorded, just not where DDB is looking.

> >instruction; it's a best-effort kind of thing. It may be a couple of
> >instructions late, and I think that's happening here.
> 
> Quoting Programming Environment Manual, 32 bit, page 6-8:
> 
> For machine check exceptions, SRR0 holds either an instruction that would have
> completed or some instruction following it that would have completed if the
> exception had not occurred.

See the MPC7450 manual, section 4.6.2, "Machine Check Exception":

  A TEA indication on the bus can result from any load or store
  operation initiated by the processor. In general, TEA is expected to
  be used by a memory controller to indicate that a memory parity
  error or an uncorrectable memory ECC error has occurred. Note that
  the resulting machine check exception is imprecise and unordered
  with respect to the instruction that originated the bus operation.


> However, as you didn't reply to the original post, I get your point that I
> am wasting my time and that I should wait until the cathedral builders
> think there is a problem.

I'm sorry. There was no hidden message in the choice of what I replied
to; it's just easier for me to reproduce the original bug myself and
analyze it in a controlled setting than it is to work through the
description of an analysis on a machine I don't have. Our goal here is
the same: fixing the reported bugs.

        - Nathan