port-i386: Re: pmap_zero

Subject: Re: pmap_zero_page problem
To: None <port-i386@netbsd.org, port-alpha@netbsd.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-i386
Date: 04/01/1999 23:37:29
On Thu, Apr 01, 1999 at 01:50:35PM -0500, Thor Lancelot Simon wrote:
> Cyrix 6x86 (non-MMX) 120MHz, Intel TX-chipset motherboard, 32MB 66MHz SDRAM.
> 
> 3Com 3c905B with "ex" driver, and a single Seagate Medallist 6.5GB UDMA IDE
> disk.  A bunch of other stuff was running -- samba, lpd, xfs, etc.  Amusingly,
> I thought the problem was LFS-related but actually the machine didn't have any
> LFS filesystems mounted, I'd forgotten that I'd switched the samba volume back
> to FFS before I started having problems.
> 
> I didn't tweak the buffer cache at all.

Mine is a 6x86 133Mhz. It isn't MMX either, and I think it's an IBM.
80MB 60ns EDO RAM, Intel chipset.
1 IDE (Multiword DMA mode 1), 1 scsi disk (NCR controller).
1 WD8013EBT Ethernet
20Mb /tmp MFS, 20Mb buffer cache.
LFS configured in kernel, but not mounted.

Here what I've found so far:
I get "panic: pmap_zero_page: lock botch" when running a parrallel make
in my src tree (which lives on the scsi disk). A good way to get them
is to run a 'make -j4 clean'. Happens even if the tree is clean, and even if
all but /tmp disks are mounted read-only with the system single user
(no X, no swap configured). I get them with / and /usr
on either the IDE or SCSI disk (in the later case, IDE is not used at all).
I can occasionally get other symtoms of what I think is the same cause:
page fault traps, or random coredump of cc1.

If I protect pmap_zero_page with splhigh()/splx(), I don't get the
pmap_zero_page panic, but still the page fault traps or core dumps.

I tried setting the  RAM speed to 70ns in the BIOS, didn't change anything.
I tried to swap the 8 and 32MB SIMMS, didn't change anything.


> 
> Of course, it was precisely when I went to try to write down a traceback that
> I stopped being able to make the problem happen -- but as I recall the traces
> I got before but didn't write down ran through the vnode pager (which kind of
> makes sense, since the problem was happening when the program exited, not
> while it was writing data).

What you're saying here is interesting:
for each pmap_zero_page panic, I get the same stack trace:
#0  0xf01cc2f5 in pmap_map ()
#1  0xf01c8d87 in cpu_reboot ()
#2  0xf01274f1 in panic ()
#3  0xf01cc32e in pmap_zero_page ()
#4  0xf01bd7a6 in uvm_pagezero ()
#5  0xf01b7df0 in uvm_fault ()
#6  0xf01cfa73 in trap ()
#7  0xf0100cf9 in calltrap ()
can not access 0xefbfd2e4, invalid translation (invalid PDE)
can not access 0xefbfd2e4, invalid translation (invalid PDE)
Cannot access memory at address 0xefbfd2e4.

Now, I just ran a 'make clean' in single user mode, not swap space configured.
No panic, but a hang. still keyboard echo, but ^C or ^Z doesn't do anything
(sounds familiard to the alpa peoples ? This is why this is posted to
port-alpha too).
ctrl-alt-del got me in the debugger, the hung process was a sh.
Here's the stack trace, written by hand:
syscall (#272)
syscall at syscall+0x1f6
sys_getdents(fdf64c30,fdfb9f88,fdfb80,0,63040) at sys_getdents+0x45
vn_readdir(fdf740b4,66000,0,1000,fdfb9f34) at vn_readdir+0xbc
ufs_readdir(fdfb9ec8,fdfb9f88,fdfb4c30,fdfb9f80,fdfb9ec8) at ufs_readdir+0xa3
ffs_read(fdfb9e28,fdfb9f88,fdf92c30,fdfb9f80,4) at ffs_read+0x356
calltrap(fa66e00,200,fdfb9ee4,fdfb9f88,b) at calltrap+0xb
end(f2890010,ffff0010,66000,fa66e000,fdfb9d10) at 0xfdfb9bd0
trap (#6)
trap() at trap + 0x342
(interrupt, and then the stack from Xintr1 to ddb).

>From there I called cpu_reboot(0x104) and got a core dump.
I will put it on line tomorow morning.

I suspect the trap in ffs_read() is not really normal ...

--
Manuel Bouyer <bouyer@antioche.eu.org>
--