Subject: Re: pagedeamon deadlocks (2)
To: Gilbert Fernandes <email@example.com>
From: Daniel Carosone <firstname.lastname@example.org>
Date: 02/19/2004 18:07:49
Content-Type: text/plain; charset=us-ascii
On Thu, Feb 19, 2004 at 06:15:34AM +0000, Gilbert Fernandes wrote:
> there's been a fix to weeks ago to a part of the page daemon because
> it it was counting in bytes and freeing in buffers
This is not directly relevant (ie, the arithmetic error isn't the
*cause* of the issue either way.
What has changed is the relative sizing of various things that compete
for memory, especially with the new buffer cache code as a new
contender (previously it was a fixed size).
The deadlock condition was always potentially there, but it's now been
exposed more often. It will hit you hardest if you have any of
several combinations of factors leading to a must-allocate-to-write,
must-write-to-free condition, and low memory.
The new buffercache really allows softdep to shine, and will produce
immense performance benefits for filesystem ops that create and delete
lots of files and directories. Unfortunately, it also exposes a
weakness - there's nothing except the size of buffercache that will
limit the number of writes it will issue in a big burst.
In particular, the problem arises when you've got a buffercache full
of dirty buffers, and also page cache full of dirty file data, such as
a big cvs update or untar of a large source tree. When the pagedaemon
wakes up to free memory, anything it might free needs to be written to
The tuning improvements (including fixing some of the arithmetic/units
errors like you mention) we made recently have improved the memory
shrink/grow behaviour of the new buffercache considerably, with
respect to winding up in one of those conditions - but you can still
Most of the times I see this problem now, it's non-fatal - which is a
*big* improvement over earlier. I have a setup that is particularly
suited to triggering it, because for me, doing those writes involves
allocating more memory. That's normally something you want to avoid
desperately in a disk driver, precisely to avoid this deadlock, and
That aside, for the general user, part of this might be addressed by
further fine-tuning, but some of it needs to be addressed further up
and down the line. Down the line, to remove any similar
allocate-to-write behaviour, and up the line to pace softdep
generating big loads of metadata updates.
I have a tiny hack in my tree that helps a little with that, by
spreading the i/o scheduling of those writes out a little, which is
just enough to demonstrate that more work on this should be
> i thought that the uvm would keep some stuff to be able to allocate
> inodes and always be able to swap out (we get into uvm_pageout,
> buf_drain and buf_trim to get into softdep-disk-io-initiation and
> then uvm_km_kmemalloc1 is called and the call to uvm_wait that
> follows puts us into deadlock since we're already waiting for pages
> to get freed and if the pagedaemon cant even find some by paging
> out.. moo stuck. moo.
One thing I haven't yet had a chance to try is moving the buf_drain
back to the end of the pagedaemon, where it used to be. It was moved
to the front as part of an effort to get the bufcache to respond
better to memory pressure, by asking it to give back memory first.
This worked very well in stopping the buffercache growing too fast in
general use, before the last round of other tuning efforts and
bugfixes did even better - it may no longer be necessary.
But if its full of dirty pages, asking it to give back too much at
once will worsen the problem, and we're better off doing so once
something else has had a chance to release memory. Beware that doing
so will require a revisit of the tuning values as a result.
There is also another behaviour change under consideration that may be
beneficial in handling the burst of dirty metadata buffers, but its
all an effort to balance resources rather than a strict recipe, such
as a protocol to avoid lock-based deadlocks.
In the meantime, if you're repeatedly suffering from this, the easiest
thing to do is play with the externally accessible tuning parameters.
In particular, try
sysctl -w vm.bufcache=3D10
And repeat your test. That should at least make it harder to get into
this state, let me know what you find.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (NetBSD)
-----END PGP SIGNATURE-----