netbsd-bugs: kern/33278: lockup with uvm_fault sleeping on "flt_pmfail2" and pagedaemon running but not helping

Subject: kern/33278: lockup with uvm_fault sleeping on "flt_pmfail2" and pagedaemon running but not helping
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 04/18/2006 02:55:01
>Number:         33278
>Category:       kern
>Synopsis:       A process repeatetdly sleeps in uvm_fault on "flt_pmfail2"; pagedaemon is repeatedly woken up and does not help it; and nothing else can run.
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Apr 18 02:55:00 +0000 2006
>Originator:     Jed Davis
>Release:        NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD mailproc1.panix.com 3.0 NetBSD 3.0 (PANIX-APPLIANCE) #0: Wed Mar 22 20:57:32 EST 2006  root@trinity.nyc.access.net:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-APPLIANCE i386
Architecture: i386
Machine: i386
>Description:

The host in question runs 3.0/i386, is diskless, and has an NFS swap
file which is only rarely used.  Earlier today it locked up (answered
ping but did nothing else) in an interesting way -- breaking it
repeatedly, or setting a breakpoint on ltsleep(), showed that it was
alternating between running a user command -- which had taken a page
fault and kept sleeping with message "flt_pmfail2", which appears to
happen only when the pmap_enter to resolve the fault fails -- and the
pagedaemon, which was being woken up from the sleep at the top of
the loop in uvm_pageout(), doing something (it wasn't clear what),
then waking up the user process and going back to sleep.  Clearly the
pagedaemon wasn't helping whatever the faulting process's problem was,
because it would keep failing the pmap_enter and going back to sleep.

And the weird thing is that it wasn't out of ram -- uvmexp reported
30054 pages free, and 14709 of 131072 swap pages in use.  Full "show
uvm" output was:

Current UVM status:
  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
  250200 VM pages: 118266 active, 72662 inactive, 1856 wired, 30054 free
  min  10% (25) anon, 5% (12) file, 5% (12) exec
  max  90% (230) anon, 10% (25) file, 30% (76) exec
  pages  171710 anon, 17940 file, 11042 exec
  freemin=64, free-target=85, inactive-target=63642, wired-max=83400
  faults=857377139, traps=1028644388, intrs=206793316, ctxswitch=594529707
  softint=197099867, syscalls=-1658109657, swapins=343, swapouts=363
  fault counts:
    noram=21795, noanon=0, pgwait=28, pgrele=0
    ok relocks(total)=4302(4326), anget(retrys)=210566639(3935), amapcopy=120469760
    neighbor anon/obj pg=170692371/1524327373, gets(lock/unlock)=444367593/391
    cases: anon=132483510, anoncow=69967129, obj=378986201, prcopy=65381150, przero=229724299
  daemon and swap counts:
    woke=30080168, revs=739, scans=1438018, obscans=137186, anscans=14713
    busy=0, freed=151899, reactivate=540466, deactivate=1706185
    pageouts=986, pending=0, nswget=3386
    nswapdev=1, nanon=365567, nanonneeded=365567 nfreeanon=192581
    swpages=131072, swpginuse=14709, swpgonly=11331 paging=0

>How-To-Repeat:

No obvious way to reproduce, and this hasn't been occurring often enough
to be a pain, yet.  In any case, since the host is diskless, I can't get
a core.  My hope is that I've gathered enough information that a problem
might be visible from inspection, for someone else at least.

>Fix:

Rebooting the box makes for a passable, if rather suboptimal, workaround.

r1.45 of uvm_bio.c was mentioned on a mailing list recently, and it 
looks like it might be related, but I don't know if it is.