current-users: Re: request for testers of PDPOLICY

Subject: Re: request for testers of PDPOLICY_CLOCKPRO
To: None <current-users@NetBSD.org>
From: Brian de Alwis <bsd@cs.ubc.ca>
List: current-users
Date: 03/02/2007 11:07:50
I've continued to use PDPOLICY_CLOCKPRO for several weeks, and
noticed some other anomolous behaviour in the last couple of days.
I should note that this is from the day before the newlock2 commits
(I couldn't afford for this machine to be out of whack).


I can only describe the symptoms of my first issue: I haven't been
able to trace it to any reason.  I've been doing some heavy memory-use
Java work using Eclipse and lots of database activity using Apache
Derby (a Java DB).  I recently noticed that killing/exiting a
high-mem-consumption Java process (e.g., 100-200MB of memory) will
lead to the system appearing to sometimes freeze.  There is no disk
activity.  It usually recovers after a short time (sometimes seconds,
sometimes 10s of seconds), and I've sometimes noticed a pms0 reset
message on /dev/console shortly after it revives:

    pms0: resetting mouse interface

I also experienced some *very* weird issues under X11 last night,
possibly related to this mouse problem, where the system was no
longer properly tracking focus.  I could use the keyboard, but the
mouse wouldn't seem to work.

I have my vm.coldtargetpct=40, and the file-cache seemed fine (about
140MB).


In the second issue, I've also had about 5 panics since my last
report.  Unfortunately I haven't been able to trace them as something
is mucking up the stack:

    (gdb) bt
    #0  0xc04eb04c in cpu_reboot (howto=0, bootstr=0x0)
	at /usr/src/sys/arch/i386/i386/machdep.c:910
    #1  0xc0454d58 in panic (fmt=0xc08f6cac "trap")
	at /usr/src/sys/kern/subr_prf.c:246
    #2  0xc04f70de in trap (frame=0xccc955a8)
	at /usr/src/sys/arch/i386/i386/trap.c:336
    #3  0xc010c4b2 in calltrap ()
    #4  0xc04e7300 in db_read_bytes (addr=6, size=4, 
	data=0xccc95614 "\bWÉÌ\bWÉÌ\004")
	at /usr/src/sys/arch/i386/i386/db_memrw.c:98
    #5  0xc0198e23 in db_get_value (addr=6, size=4, is_signed=0)
	at /usr/src/sys/ddb/db_access.c:62
    #6  0xc04e7afd in db_stack_trace_print (addr=-859220216, have_addr=1, 
	count=65535, modif=0xc0918ce9 "", pr=0xc0454b40 <printf>)
	at /usr/src/sys/arch/i386/i386/db_trace.c:467
    #7  0xc0454d2f in panic (fmt=0xc08f6cac "trap")
	at /usr/src/sys/kern/subr_prf.c:235
    #8  0xc04f70de in trap (frame=0xccc957ac)
	at /usr/src/sys/arch/i386/i386/trap.c:336
    #9  0xc010c4b2 in calltrap ()
    #10 0xc07568f4 in memset ()
    Previous frame inner to this frame (corrupt stack?)

I was in X, so wasn't able to get the panic message, and it wasn't
in /var/log/messages.  But reconstructing the details from the
trap() call from the core, it would have been something like:

    fatal page fault in supervisor mode
    trap type 6 code 2 eip c07568f4 cs 8 eflags 10293 

I unfortunately can't get the cr2 and ilevel values.

Going to the memset frame:

    (gdb) info fr
    Stack level 10, frame at 0xccc957ac:
     eip = 0xc07568f4 in memset; saved eip 0xccc957ac
     caller of frame at 0xccc957b0
     Arglist at 0xccc957a4, args: 
     Locals at 0xccc957a4, Previous frame's sp is 0xccc957ac
     Saved registers:
      eip at 0xccc957a8


Unfortunately the first behaviour is too frustrating, so I've
reverted to a kernel with the normal paging policy.  I'll see if
that exhibits the same problems.

Brian.

-- 
  Brian de Alwis | Software Practices Lab | UBC | http://www.cs.ubc.ca/~bsd/
      "Amusement to an observing mind is study." - Benjamin Disraeli