Subject: panic: pv_unlink0 - do my eyes deceive me?
To: None <port-sparc@NetBSD.ORG>
From: Erik E. Fair <fair@clock.org>
List: port-sparc
Date: 05/10/1997 05:24:46
synopsys: 96M sun4m (SPARC LX) dies of "panic: pv_unlink0" periodically
(period seems dependent on a matrix of variables, including kernel size
(the bigger, the better), and offered load).

I've been staring at kernel code, and at a kernel crash dump, and I see a
*very* wierd behavior. Either gdb is lying to me, or there is a cache (or
something!) playing games with what I got in the crash dump.

First, the most recent traceback:

(kgdb) where
#0  0xf80fc1e4 in dumpsys () at ../../../../arch/sparc/sparc/machdep.c:766
#1  0xf80fbf98 in cpu_reboot (howto=256, user_boot_string=0x0)
    at ../../../../arch/sparc/sparc/machdep.c:676
#2  0xf802c53c in panic (fmt=0x0) at ../../../../kern/subr_prf.c:149
#3  0xf80fe804 in pv_unlink4m (pv=0xf81a3770, pm=0xf813b0d0, va=4171255808)
    at ../../../../arch/sparc/sparc/pmap.c:2356
#4  0xf8102210 in pmap_enk4m (pm=0xf813b0d0, va=4171255808, prot=7,
    wired=-132591616, pv=0xf81c9ab0, pteproto=4060830)
    at ../../../../arch/sparc/sparc/pmap.c:5375
#5  0xf8101f94 in pmap_enter4m (pm=0xf813b0d0, va=4171255808, pa=64970752,
    prot=7, wired=1) at ../../../../arch/sparc/sparc/pmap.c:5315
#6  0xf80d0d94 in vm_fault (map=0xf81ec108, vaddr=4171255808, fault_type=7,
    change_wiring=1) at ../../../../vm/vm_fault.c:826
#7  0xf80d0f44 in vm_fault_wire (map=0xf81ec108, start=4171227136,
    end=4171259904) at ../../../../vm/vm_fault.c:884
#8  0xf80d36f0 in vm_map_pageable (map=0xf81ec108, start=4171227136,
    end=4171259904, new_pageable=0) at ../../../../vm/vm_map.c:1337
#9  0xf80d20e0 in kmem_malloc (map=0xf81ec108, size=32768, canwait=0)
    at ../../../../vm/vm_kern.c:321
#10 0xf8021714 in malloc (size=32768, type=84, flags=0)
    at ../../../../kern/kern_malloc.c:145
#11 0xf81162f0 in sunos_sys_getdents (p=0xf88d0200, v=0xfc738f28,
    retval=0xfc738f20) at ../../../../compat/sunos/sunos_misc.c:456
#12 0xf8106460 in syscall (code=174, tf=0xfc738fb0, pc=268618388)
    at ../../../../arch/sparc/sparc/trap.c:1100


Now, note, if you will, the following logic:

malloc is called with "flags" = 0. According to sys/malloc.h:

#define M_WAITOK        0x0000

later, in malloc, kmem_malloc is called with "canwait" set to zero.
However, the code that does this reads as follows:

va = (caddr_t) kmem_malloc(kmem_map, (vm_size_t)ctob(npg),
                                           !(flags & M_NOWAIT));

That last bit there is intended to reverse the sense of the argument,
because the argument to kmem_malloc is indeed reversed. But wait! That
means if the "flags" argument to malloc is zero, the "canwait" argument to
kmem_malloc must be non-zero ("true"), and vice versa. Yet the crash dump
shows them both as zero. 'Tis A Puzzlement.

However, to confound matters even further, the very next subroutine on the
stack is vm_map_pageable, called by kmem_malloc, which can only be called
if "canwait" is non-zero! Paint me very confused: the right result from the
wrong arguments? (or did someone later in the call chain fiddle the stack?)

I'm sending this to port-sparc because it happens on this hardware (indeed,
appears to be specific to it), despite the fact that all the code
referenced so far in this missive is in the MI part of the kernel...

Anyone want to venture a guess as to what is going on here?

Further staring down the stack makes me interested in the changing value of
the argument "wired" which is an integer, simply passed down from routine
to routine, but whose value changes from "1" (sensible) to some wierd
negative number along the way.

Finally, I got down into pmap_enk4m and noticed a small thing: a little
ways before the reference to pv_unlink (the death of me), it calls
pmap_changeprot.

Shouldn't that be pmap_changeprot4m?

(The hell of this observation is that I'm running a 4M only kernel, so
likely the right thing is happening anyway with a macro substitution, but
the GENERIC kernel (and any other multiplatform kernel) is going to have
the wrong call made).

In addition, the diagnostic messages in the various pmap_* routines have
not been changed to reflect the new, hardware-specific names of the
routines that are actually being called (no doubt dating back to when these
were one set of routines that just handled 4/4c and not 4m), which could be
confusing to some poor schmoe like me, trying to debug this; diffs will be
send-pr'd tomorrow.

	Erik Fair <fair@clock.org>