tech-kern: Re: Limiting disk I/O?

Subject: Re: Limiting disk I/O?
To: Steven M. Bellovin <smb@cs.columbia.edu>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 11/12/2007 20:12:37
On Tue, Nov 13, 2007 at 12:18:17AM +0000, Steven M. Bellovin wrote:
> On Mon, 12 Nov 2007 20:38:10 +0000 (UTC)
> mlelstv@serpens.de (Michael van Elst) wrote:
> > 
> > The problem comes from running out of buffers. You want to sync
> > often and you want to sync fewer buffers in a burst to mitigate
> > the problem.
> 
> This, I think, is a large part of the actual solution to the problem.
> The other part is to have the process scheduler penalize processes -- or
> users? -- that do too much I/O.

Well, yes.  But even just the first part would suffice.  The basic problem
with softdep, and with the "smooth" syncer we got with it, is that it
schedules all the I/O enqueued in a second to be synced out at some second
xdelay seconds in the future, for each of the few types of metadata I/O x.
Worse, it tracks metadata I/O to be done in the future by vnode, and it
attaches most metadata I/O to the mount point's own vnode, so it all gets
crammed together.  And there's no way to track processes that cause
metadata I/O, so you can't really penalize them.

As a first step:

Really to fix this I think you need to have some idea how many tps your
disk subsystem can soak up (this will never be perfect but you can get
a lot closer than we do now just by observing the maximum and perhaps
leaky-bucketing it) and then you need to enqueue I/O into each of the
seconds from now until some maximum in the future where if you've enqueued
I/O that far, you really _do_ start blocking processes.  That is, look at
each second (or 1/10 second, or even a smaller quantum) as a bucket to be
filled up with transactions, probably to no more than half the maximum tps
you've seen in the past, and start filling from now (or some initial delay
value meant to catch I/O that becomes redundant) into the future.  And we
_have to_ get all the metadata I/O off the mount point's vnode, or find a
different way to track pending I/O than per-vnode.

I think this would be much better than what we have now.  But others
who certainly know the literature in this area much better than I do have
proposed other solutions and worked hard to implement them and they haven't
sufficed, or even made things worse -- so take this all with a large grain
of salt.

Thor