port-powerpc: Re: Followup: MCHK exception in -current with MMU off

Subject: Re: Followup: MCHK exception in -current with MMU off
To: Matt Thomas <matt@3am-software.com>
From: Tim Kelly <hockey@dialectronics.com>
List: port-powerpc
Date: 04/11/2005 13:35:24
At 9:45 AM -0700 4/11/05, Matt Thomas wrote:
>Tim Kelly wrote:
>> At 4:08 AM -0700 4/11/05, Matt Thomas wrote:
>>
>>>You have to fetch the registers from the trapframe which the address of
>>>which ddb prints for you
>>>
>>>0xd521a900: kernel MCHK trap by vzeropage+0x88: srr1=0x2041020
>>>           r1=0xd521a9c0 cr=0x40424088 xer=0 ctr=0x2927dc
>>>
>>>In the above case, register 0-31 are at 0xd521a908 starting with
>>>register 0.  Look at <powerpc/frame.h> to see the layout of a
>>>trapframe.
>>
>>
>> And what is the instruction in ddb to switch to this stack frame? Just to
>> be clear - panics within traps leave the debugger one (?) stack frame off?
>> I'm basing this on the address you list being eight bytes off the address
>> listed by ddb.
>
>Sigh.  The reason it's 8 bytes off is that there is the standard space
>reserved at the start of the frame for the next frame's need to save
>the SP (R1) and LR (return address).  The trapframe follows immediately
>after those 8 bytes.

I was not able to glean this from the header file you suggested, although
that may be my fault. Thank you for clarifying this.

>ddb doesn't the notion of up/down like gdb.  All you get is 'x'.
>
>> How many traps have panics as a final result? Or more precisely, how many
>> panics are caused as a result of trap?
>
>Literally, *everything* happens due to a trap/execption.  The kernel
>exists to do userland's bidding.  The userland indicates it wants
>something to happen via exceptions like system call (SC), data fault
>(DSI), instruction fault (ISI), program fault (PGM), etc.
>
>Eventually all panic are indirectly or directly due to exceptions.
>The machine check (MCHK) is one of those panic directly due to an
>exception.

There really needs to be better bookkeeping on this, not because there is
something wrong with the current implementation but because on every single
reported panic that I've seen on macppc that included registers it was from
the frame ddb expects. Since in many, many cases the problem is not
repeatable by others on the list, all we have to go by is what ddb gives
us. With MCHK exceptions in particular, of which at least one list reader
gets at a random occasion all too often, it is next to impossible to
accurate isolate the problem without the correct register values and unless
the user knows to dump the previous stack frame, anyone trying to help is
SOL.

>> When I posted my original analysis and stated that it seemed odd to me that
>> r9 was in the middle of kernel physical memory, and I asked if anyone had
>> any additional information to help isolate this, I was referring to the
>> kind of information above. Otherwise I could have seen that r9 was in fact
>> not a valid physical address and that the conditions for a MCHK exception
>> were being met.
>
>And how is a reader to know that you didn't know that?  Until your
>reply to Nathan Williams indicated a errant assumption, I assumed
>you knew it.

A valid point, although I'm not aware that it is expected to be common
knowledge that ddb will not display the correct register values under some
conditions. I'll even suggest that in the panic message a reminder to view
the correct registers should be included. Not that I'm a fan of OpenBSD,
but I do admit their panic message is quite clear about what is expected of
someone reporting the problem.

>> Gee, this whole scenario sounds so familiar. I do the groundwork for
>> resolving the problem, I miss a detail that someone with more knowledge
>> than I have checking behind me could have found, except apparently my
>> efforts aren't actually reviewed.
>
>This is an volunteer project.  No one is obligated to help.  Until
>you demonstrate a lack of prowess and/or someone has inclination
>to help, assume you are your own.  It's not a nice assumption but
>it's an accurate one.

Odd way to phrase it - "lack of prowess." If I sort of know what I'm doing,
I'm on my own, but if I'm a newbie it'll get attention? It shouldn't work
this way, and I'm not advocating snobbery, either. People that know
something should attempt to help those that know enough to ask the
questions, and that's a sliding scale. I don't think it is fair to ask you
or Nathan to help out people who need instructions on the OF commands to
boot macppc, and I think it is reasonable that neither of you do. There are
plenty of people that can help with something like this, and occasionally
those people ask me if I know something about the problem, if it isn't
easily solveable. If I see a problem posted that isn't going to be a ground
level question, like Timo's, I try to see if I can help out. When I get to
the point that I've reached the limits of my knowledge and I post about it,
that is when it'd be nice if someone with more knowledge offered some
assistance, instead of silence - until Nathan did the same work
independently (I did get one offline response).

The alternative is that people like myself that are willing to examine
middle level problems stop doing so and there becomes a large gap between
the people using NetBSD and the people developing NetBSD.

If NetBSD wants growth and longevity, it can't be starving and/or eating
its young.

tim