Subject: kern/33278: lockup with uvm_fault sleeping on "flt_pmfail2" and pagedaemon running but not helping
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 04/18/2006 02:55:01
>Number: 33278
>Category: kern
>Synopsis: A process repeatetdly sleeps in uvm_fault on "flt_pmfail2"; pagedaemon is repeatedly woken up and does not help it; and nothing else can run.
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Apr 18 02:55:00 +0000 2006
>Originator: Jed Davis
>Release: NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD mailproc1.panix.com 3.0 NetBSD 3.0 (PANIX-APPLIANCE) #0: Wed Mar 22 20:57:32 EST 2006 root@trinity.nyc.access.net:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-APPLIANCE i386
Architecture: i386
Machine: i386
>Description:
The host in question runs 3.0/i386, is diskless, and has an NFS swap
file which is only rarely used. Earlier today it locked up (answered
ping but did nothing else) in an interesting way -- breaking it
repeatedly, or setting a breakpoint on ltsleep(), showed that it was
alternating between running a user command -- which had taken a page
fault and kept sleeping with message "flt_pmfail2", which appears to
happen only when the pmap_enter to resolve the fault fails -- and the
pagedaemon, which was being woken up from the sleep at the top of
the loop in uvm_pageout(), doing something (it wasn't clear what),
then waking up the user process and going back to sleep. Clearly the
pagedaemon wasn't helping whatever the faulting process's problem was,
because it would keep failing the pmap_enter and going back to sleep.
And the weird thing is that it wasn't out of ram -- uvmexp reported
30054 pages free, and 14709 of 131072 swap pages in use. Full "show
uvm" output was:
Current UVM status:
pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
250200 VM pages: 118266 active, 72662 inactive, 1856 wired, 30054 free
min 10% (25) anon, 5% (12) file, 5% (12) exec
max 90% (230) anon, 10% (25) file, 30% (76) exec
pages 171710 anon, 17940 file, 11042 exec
freemin=64, free-target=85, inactive-target=63642, wired-max=83400
faults=857377139, traps=1028644388, intrs=206793316, ctxswitch=594529707
softint=197099867, syscalls=-1658109657, swapins=343, swapouts=363
fault counts:
noram=21795, noanon=0, pgwait=28, pgrele=0
ok relocks(total)=4302(4326), anget(retrys)=210566639(3935), amapcopy=120469760
neighbor anon/obj pg=170692371/1524327373, gets(lock/unlock)=444367593/391
cases: anon=132483510, anoncow=69967129, obj=378986201, prcopy=65381150, przero=229724299
daemon and swap counts:
woke=30080168, revs=739, scans=1438018, obscans=137186, anscans=14713
busy=0, freed=151899, reactivate=540466, deactivate=1706185
pageouts=986, pending=0, nswget=3386
nswapdev=1, nanon=365567, nanonneeded=365567 nfreeanon=192581
swpages=131072, swpginuse=14709, swpgonly=11331 paging=0
>How-To-Repeat:
No obvious way to reproduce, and this hasn't been occurring often enough
to be a pain, yet. In any case, since the host is diskless, I can't get
a core. My hope is that I've gathered enough information that a problem
might be visible from inspection, for someone else at least.
>Fix:
Rebooting the box makes for a passable, if rather suboptimal, workaround.
r1.45 of uvm_bio.c was mentioned on a mailing list recently, and it
looks like it might be related, but I don't know if it is.