port-powerpc: Re: Followup: MCHK exception in -current with MMU off

Subject: Re: Followup: MCHK exception in -current with MMU off
To: Nathan J. Williams <nathanw@wasabisystems.com>
From: Tim Kelly <hockey@dialectronics.com>
List: port-powerpc
Date: 04/11/2005 13:00:20
At 12:09 PM -0400 4/11/05, Nathan J. Williams wrote:

>I'm quite willing to believe that there's more than one bug. Is that
>also with -O0? As I said in my first mail, I think that the
>non-Altivec implementation of pmap_copy_page() and pmap_zero_page()
>have similar issues with accessing local variables when PSL_DR is
>off. However, I wasn't able to trigger that bug with pmap_subr.c
>compiled -O0 and pmap_use_altivec forced to 0; I didn't otherwise turn
>off Altivec.

Yes, the pmap bug was with optimizations off and Altivec manually turned
off. It tracked into the same situation that you explained. The code is
inlined, but referred to a C variable (pa). My main point in pointing this
out is that it would appear that the approach you recommended is valid
perhaps across the board.

When I was looking at the non-optimized code and didn't realize it wasn't
optimized, I noticed a couple functions that could have been inlined. Once
I started trying to inline the assembly, I realized that the problem is not
knowing what register contains the value to be manipulated. If the entire
functions that are requiring optimization are rewritten in assembly, that
would eliminate the variable register concern, but would require a lot of
work.

Would a short-term solution be to force the code with MSR_DR off be
optimized? Additionally, is the Altivec specific code fast enough to
warrant using it even though it causes overhead on non-Altivec CPUs like
G3s to make sure Altivec can be used? If Altivec was not enabled by
default, but an option, it could constrain the code to within #defines
while eliminating the conditional across the board.

>As Matt said, they're still there on the call stack. The principal use
>of the registers in DDB is for looking at the live registers after
>hitting a breakpoint and when stepping through code; that's a
>situation where DDB is invoked directly by the CPU's exception
>mechanics, rather than through trap(). What would be a useful change
>here is not to change DDB, but to make the "panic: trap" path dump
>more state - after all, DDB might not be there at all, and a register
>dump would be useful for post-mortem analysis.

At the risk of being redundant with my previous statements, I would like to
see better care taken within traps that lead to panics so that the
registers displayed are the registers at the time the exception occured.

>> Beyond the few registers that are used for panic's output strings, why
>> would r9 get mucked before it could be recorded? That's really poor.
>
>It is recorded, just not where DDB is looking.

Understood (now). Is there a mechanism to force ddb to use a different
stack frame?

>> >instruction; it's a best-effort kind of thing. It may be a couple of
>> >instructions late, and I think that's happening here.
>>
>> Quoting Programming Environment Manual, 32 bit, page 6-8:
>>
>> For machine check exceptions, SRR0 holds either an instruction that
>>would have
>> completed or some instruction following it that would have completed if t=
he
>> exception had not occurred.
>
>See the MPC7450 manual, section 4.6.2, "Machine Check Exception":
>
>  A TEA indication on the bus can result from any load or store
>  operation initiated by the processor. In general, TEA is expected to
>  be used by a memory controller to indicate that a memory parity
>  error or an uncorrectable memory ECC error has occurred. Note that
>  the resulting machine check exception is imprecise and unordered
>  with respect to the instruction that originated the bus operation.

Yes, but there is a isync immediately preceeding the code causing the
exception. Since the instruction that apparently causes the exception is
only a couple instructions later, and the requirement is that the SRR0
value is the earliest possible instruction that could have caused it, this
not a lot of room for it to be anything else. The loop in vzeropage
increments r9 after each pass, and since the value in r9 was on an aligned
address, I concluded that it was the first time that the stvx was called.
Had I known that r9 is not guaranteed to be what r9 was at the time of the
exception, I would have seen an attempt to access invalid memory address.
All I could go on was the belief that it was the correct register value and
that no call chains needed to alter r9, as it is generally out of the range
of volatile registers.

>> However, as you didn't reply to the original post, I get your point that =
I
>> am wasting my time and that I should wait until the cathedral builders
>> think there is a problem.
>
>I'm sorry. There was no hidden message in the choice of what I replied
>to; it's just easier for me to reproduce the original bug myself and
>analyze it in a controlled setting than it is to work through the
>description of an analysis on a machine I don't have. Our goal here is
>the same: fixing the reported bugs.

I find first-hand that when someone in core makes the statement that NetBSD
isn't "building cathedrals," it lacks credibility. I am truly frustrated by
the lack of feedback from people with more knowledge than I, and when again
I did a lot of groundwork and it appeared someone else steps in without
acknowledging the work done by myself and Timo, and also says my analysis
is flawed but doesn't seem to offer any different data, I'm going to be
testy. It comes off as building cathedrals. Instead of working with me to
improve my knowledge and save you time, I see it as "the clergy" and "the
congregation," and the congregation doesn't get heard but the clergy hands
down their edicts as if they occured in a vacuum.

tim