Subject: Re: state of BUFQ_PRIOCSCAN
To: Sumantra Kundu <>
From: Daniel Carosone <>
List: tech-kern
Date: 09/21/2006 09:21:46
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Sep 20, 2006 at 09:50:10AM -0500, Sumantra Kundu wrote:
> To join in the discussion, let me point out that as part of google
> summer of code 2006, we initiated a project (mentor: Bill  Studenmund)
> to implement congestion control inside the uvm so that in the presence
> of multiple readers and writers, disk access gets skewed towards the
> readers process.
> The algo works well till the number of free pages drops to a ciritical
> low level and the page daemon gets fired. This causes the reader
> processes to stall since there are no pages to read in the data..=20

As I've said before, I'm a little skeptical of the premise that
favouring readers is automatically the right thing.  I'm also quite
sure your excellent work will be the best way to answer the question
with measurement rather than speculation, so please, keep it up.

But something about your wording above made me see this issue from
another angle as well; one that (in a small way) reinforces my doubts.

Clearly, we need backpressure on the processes that are dirtying
buffers faster than the IO system can deal with them, and this is the
biggest part of the problem - but by favouring readers at the bottom
end, we're also contributing to the problem by not cleaning those
buffers sooner.  Certainly what seems clear from your results so far
is that doing the latter without also doing the former results very
quickly in buffer starvation.

I strongly suspect that the best result depends critically on knowing
just *which* reads to prioritise, and being very selective about
those.  The classic example would be a metadata read that would allow
the filesystem to then issue a whole bunch of writes currently filling
uvm with dirty pages.  That's essentially what PRIOCSCAN attempts to
do, but it needs more information to really be effective.

Getting those dirty pages into writes in the bufq, and then issued and
done (or at least into the disk write cache) sooner would in turn then
allow some bulk reads to have a place to land.  We've spoken about the
need for better signalling from the IO system to UVM to detect
congestion, but we also need better signalling downwards to enable
better scheduling of the work presently at hand, and respond first to
the requests that will make the biggest overall difference.

For reference:
This argues for prioritising reads, but just as important is that the
zfs scheduler prioritises within the set of reads as well using the
deadlines to communicate more detailed priorities to the disk layer.
(There are deadline writes in there that can free resources or
dependencies by completing fs transactions, too).

As for what we could try in NetBSD in the shorter term, this isn't the
first time I've wondered about an arbitrary/tunable limit in UVM on
the total (dirty pages + pages locked for reads).  The rest of the
pages would be available for clean cache.  Processes pushing uvm over
the limit would be blocked until dirty pages are cleaned or read pages
are filled when IO requests complete - even if those requests are not
yet made.

This constrained resource could model overall IO capacity for "in
flight" data, both current and upcoming, in a way that can apply
smoothing pressure on future demand.  By excluding metadata from the
resource limit, it works at the level of process demand and doesn't
penalise these reads upon which future IO requests may depend.

It's by no means a great model of IO capacity, and I don't really like
the idea of 'solving' one bottleneck by introducing another, but it
might well be a useful experiment or interim step.  Sizing and tuning
the resource limit, especially where the IO system involves several
independent disks, would be challenging - but at least it would be a
knob whose effects could be systematically studied.


Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.4.5 (NetBSD)