Subject: Re: uvm_pagefaults in 1.5.3 & 1.6
To: Peter. Bex <Peter.Bex@student.kun.nl>
From: None <don+nbsdtk@resun.com>
List: tech-kern
Date: 10/16/2002 03:13:03
>>>>> "Peter" == Peter Bex <Peter.Bex@student.kun.nl> writes:

Peter> I have exactly the same problem. My system is an AMD Athlon 1
Peter> GHz, with 512 Mb ram and the same amount of swap space.
Peter> Could it be an AMD-only problem?

Interesting.  I've seen it on both an AMD K6 and the Athlon 1800 (1.53GHz).

Peter> I've tried disabling UDMA, but it still happens. (haven't
Peter> tried disabling DMA yet)

That's consistent with the tests that I did yesterday.  I built a
system that only had a DMA drive in it, but the problem still occurred.

Peter> The problem can occur very early, but sometimes the system
Peter> will run for hours and then crash on some small disk-access.

I didn't manage to isolate it to any disk accesses, but it could
well be plausible.

Peter> In NetBSD 1.5 this never happened, so I guess the problem has
Peter> been introduced along the way to 1.6. (For the record, I'm
Peter> using 1.6 first release)

Or at least exacerbated by 1.6.  I saw the problem in 1.5.2, but
only when a particular SW system was active.  There is a lot of disk
IO involved, so it may be related.  I also killed the 1.5.2 system
by just compressing/uncompressing some core dump (512MB) files.

Peter> I've disabled any drivers I'm not using, and the ones I _am_
Peter> using are the same as in NetBSD 1.5, with the exception of
Peter> the NTFS drivers (I moved from Win'95 to Win2000).

I'll include my notes from today's test at the end.  I've broken 1.6
with all network drivers not compiled into the kernel, and in single
user mode.

Peter> I mostly get pagefaults, but sometimes I get some really
Peter> strange error saying that the kernel is 'locking against
Peter> myself'.

I've seen both of those, both on 1.5.2 and on 1.6.

Peter> Also, sometimes a file is 'wrong'. Then I build and the
Peter> compiler complains about some error in a system header. Upon
Peter> investigation, nothing is wrong. When compiling again, it
Peter> simply gets past that stage.  So the filesystem is
Peter> temporarily corrupt, so to say (or the compiler is buggy).

Agreed.  It took me nine 'makes' to get through a kernel build.
Other times, it can do it the first time.  Compiling a kernel is my
current test case.

Peter> These errors happen when compiling pkgsrc or the kernel
Peter> itself.

Same here.

Peter> A kernel trace in the debugger shows different things
Peter> everytime. (ie everytime different things cause it)

Also what I've seen.  Most, if not all, of the time, the processes
involved with compiling the kernel aren't involved.

Peter> If it matters: I'm not using any funky things, not even X,
Peter> when this happens.  Just a default install, with no extremely
Peter> special daemons. (for a ps list, just ask me)

I'm breaking 1.6 in single-user, without any network drivers built
into the kernel, running directly from the console (no X, etc.)

Peter> I'll try getting current one of the next days, and I'll
Peter> report if it's still happening.  But it is really
Peter> frustrating, since 1.6 is supposed to be stable.

I *really* appreciate your report.  I've spent many days trying to
isolate the problem to a piece of HW, somewhere.  Your observation
that it might be an AMD thing seems consistent.  And like you say,
the system can run for days without a problem.  Certain activities
seem to expose the fault.

Here are the notes from yesterday's testing:
====
11/15/02 12:00 PM

Reconfigured pulsar.  Removed all cards.  Turned off on-board
networking.  Booted into single user and successfully and cleanly
(first compile.  Took seven, before.) generated (from config, make
depend & make) a nonet kernel, without rebooting in between.  First
time was clean.  Second time 'program cc1 got fatal signal 11'.
Third time, same error as the second, different file being
compiled.  Fourth time was clean.  Power down.  Put skin back on.
On power-up, CPU @ 43 degrees C.  System at 82 degrees F.  Sit until
temps stable at CPU 49C, System 30C(86F) for five minutes.  Boot to
single user nonet. Regen nonet. 

Got:

Uvm_fault(0xe2f43468, 0x0, 0, 1) -> e
Kernel: page fault trap, code=0
Stopped in pid 1379 (cpp0) at getinoquota+0xf: movl 0x34(%esi),%eax.

T produced:
Getinoquota(e2f77694,c0c2cf00,e30a6cf0,c01f66c2) at getinoquota+0xf
Ufs_inactive(e30a6d14,10002,e2f66b4c,e30a6d40,e2f66b4c) at ufs_inactive+0x8f
VOP_INACTIVE(e2f66b4c,e2f5f734,e30a6d50,0) at VOP_INACTIVE+0x2e
Vrele(e2f66b4c,e30a6ef0,e30a6f04,c01f69ff) at vrele+0xa7
Lookup(e30a6ee0,e2f54000,400,e30a6ef8,0) at lookup+0x4a3
Namei(e30a6ee0,0,e30a6e40,c02987cf,0) at namei+0x297
Vn_open(e30a6ee0,1,0,ffffffff,e30a6f80) at vn_open+0x172
Sys_open(e2f5f734,e30a6f80,e30a6f78,c02ad4bc) at sys_open+0xbe
Syscall_plain(1f,1f,1f,1f,0) at syscall_plain+0xa7.

Sync produced:

Syncing disks... 19 1 done (drive had to spin up)

Panic: lockmgr: draining against myself.
Stopped in pid 1379 (cpp0) at cpu_Debugger+0x4: leave

Next sync produced dump.

At reboot, CPU @ 49C; sys @30C.  Boot to single user nonet.  Fsck
partitions.  Savecore run.  Dump #4.  Regen nonet.  First attempt
gets:

/var/tmp/ccmh4zgc.s: Assembler messages:
/var/tmp/ccmh4zgc.s:1309: Fatal error: Case value 252 unexpected at line 896 of file
"/autobuild/src/gnu/usr.bin/binutils/gas/../../../dist/toolchain/gas/symbols.c"

*** Error code 1

Stop.

Second attempt got:

Cc: Internal compiler error: program cc1 got fatal signal 11.

Eliminate DIMM1:

Power off and swap out memory in DIMM 1.  Wait for temps to
stabilize at CPU 49C; System 30C.  Boot into single user nonet.
Regen nonet.  First attempt fails with program cc1 got fatal signal 11.

Eliminate DIMM2:

Power-off and swap out memory in DIMM 2.  Wait for temps to stabilize
at CPU 49C; System 30C.  Boot into single user nonet.  Regen nonet.
System hangs.

Eliminate locally generated kernel:
Boot to single user GENERIC kernel from distribution.  Regen nonet.
System hangs.

Eliminate HD:
Staying on GENERIC kernel.  Add other HD to system.  Dump/restore to
that HD {/,/usr,/home}.  Dumps complete without a failure (Reduces
the chance of a cable failure being involved?).  Drop original HD.
Temps @ CPU 49C; System 31C.  Reboot to single-user GENERIC.  Regen
nonet.  First attempt fails for a Segmentation Fault.
====

Regards,
-- 
  Don Phillips         don@resun.com
  Escondido, Calif.    My opinions are just that, and no more.