kern/39173: random processes get stuck in "D" state

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/39173: random processes get stuck in "D" state
From: blymn%internode.on.net@localhost
Date: Sun, 20 Jul 2008 10:40:00 +0000 (UTC)

>Number:         39173
>Category:       kern
>Synopsis:       processes doing a lot of file i/o get stuck in "D" state
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jul 20 10:40:00 +0000 2008
>Originator:     Brett Lymn (Master of the Siren)
>Release:        NetBSD 4.99.70 cvs updated 20080720
>Organization:
Brett Lymn
>Environment:
NetBSD rover 4.99.70 NetBSD 4.99.70 (ROVER2) #100: Sun Jul 20 18:22:54 CST 2008 
 toor@rover:/usr/src/sys/arch/i386/compile/ROVER2 i386

Architecture: i386
Machine: i386
>Description:
        I am not sure when this started happening as my laptop had not been
updated for a long time.  I have a laptop with an Intel Core-Duo processor
with 4Gb of RAM, I have configured the kernel with MP support and both
processors are operational.

What I am seeing is when a process does a lot of file i/o they randomly
enter the "D" state (short kernel wait) and never return.  The bug can be
elusive, sometimes it will not surface for a long time allowing lengthy
builds from pkgsrc to run to completion, other times it will surface almost
straight away.  Normally the processes that get stuck are either as(1) or
ld(1) but I have seen it happen on cp, rm and firefox3.

Using ddb, the ps shows the process is normally waiting on the "uvn_fp2"
wait channel.  The wait channel address for this tag is actually the page
being waited for, doing a show page on this page tells me that the page
flags are <TABLED,CLEAN> (note no WANTED).

I made some modifications to the UVMTRKOWN code (uvm_page_own() and friends)
that told me where the page was last owned, what the previous tag was and
where the page was unbusy-ed (if this is in uvm_page_unbusy() then I note
the caller of that function too).  Using this modification I have found that
in every instance I have investigated that the last unbusy done on the page
was done by uvm_aio_aiodone_pages() calling uvm_page_unbusy().  This is
not to say that uvm_aio_aiodone_pages() is the culprit here - I think it is
just the victim.  I suspect that what is happening is that an aio operation
is scheduled to bring in the page but before uvm_aio_aiodone_pages() does
its work *something* sneaks in and unsets PG_WANTED so that the wakeup()
on the page is never done.  I should be able to test this by adding more
code to note if the page had PG_WANTED set when it was unbusied last.

I have been over the locking of the pages a few times, checked all the
places where PG_WANTED is manipulated but the locking all looks consistent
to me.  Places where I was not sure I put extra locking in, I also added a
few more calls to wakeup() about the code but all this did was shift which
wait channel tag the process wedged on.  Maybe I have missed a locking botch
somewhere or maybe I am chasing the wrong thing - not sure.


>How-To-Repeat:
        The most reliable way for me is to start a pkgsrc build of something
with a lot of dependencies and a lot of source code, sometimes it takes
a long time or a few tries before something wedges and other times it
happens very quickly.

>Fix:
        Unknown at the moment.

Prev by Date: Re: misc/39121
Next by Date: Re: kern/39173: random processes get stuck in "D" state
Previous by Thread: Re: misc/39121 (man ascii shows illogically formatted output)
Next by Thread: Re: kern/39173: random processes get stuck in "D" state
Indexes:

Home | Main Index | Thread Index | Old Index