Subject: Re: pmap woes
To: None <mcr@sandelman.ocunix.on.ca>
From: Gordon W. Ross <gwr@mc.com>
List: port-sun3
Date: 06/18/1996 10:09:56
> Date: Mon, 17 Jun 1996 23:12:55 -0400
> From: Michael Richardson <mcr@sandelman.ocunix.on.ca>

>   So, I couldn't sleep last night and spent several hours trying to
> put my machine into a bad state so I could debug stuff.
>   I made some headway. I wasn't very successful at discovering
> precisely which page was bad.... Just now, things screwed up again, in
> a way that "more" dies, so I hit 'pmap_debug' and captured...
>   gdb told me that 0x20020047 was the PC when things where dying.
>   Search down for 'set_pte' --- that is where things die. I note that
> set_pte is not called at all until that point...

That looks like the address where some shared library is mapped.
I would try to figure out how it got there.  The best way I've found
to locate "wrong page mapping" has been to put a kernel breakpoint
in trapsignal, and look at the virtual address passed as the "code."
Then I look at the user-mode registers (use ddb to print the first
part of the PCB at p_addr for the current process) and look back
in the user-mode stack to find the first call to a shared library
function.  Then, I disassemble forward from there (follow calls)
until I see where the shared library stub jumps off into space.
That typically happens because the page that should have had data
that was fixed up by ld.so instead contains garbage. (i.e. instead
of pointers to shared library functions, you find strings!)

It is not trivial to locate the page with a bad mapping, but I've
found that under controlled conditions, the same VA will have the
incorrect mappings almost every time.  When I observe the early,
"getty coredump" problem, the bad mapping is at 0x20a2000.  That
address should contain the 2nd page of libutil.so but has junk.
I've tried to figure out why this happens using pmap_watch_va:
  db> w pmap_db_watchva 0x20a2000
  _pmap_db_watchva                0xffffffff      =       0x20a2000
but there appears to be a lot of activity at that VA, so there
was not an obvious "bad guy" in the resultant trace...

Another thing that helps for reproducing this bug is to whack the
code that determines the memory size (sun3_startup.c:378) so it
just slams a memory size of 0x3FC000 (4MB-16K) to make memory
more scarce.

I have some ideas about how to detect unwanted recursion in the
pmap code.  That fact that this problem appears more frequently
with interrupts enabled in the si driver makes me suspect that
there may be a "missing protection" somewhere in the pmap code.
I'll pursue that...

Gordon