Subject: Re: I/O priorities
To: Greywolf <greywolf@starwolf.com>
From: Steven J. Dovich <dovich@tiac.net>
List: tech-kern
Date: 06/21/2002 01:59:38
I think it's time to back up to first principles, and at the same
time dredge up some old BSD history from another area.

Greywolf writes:
> An I/O scheduler, would that be something like the process scheduler
> where the more contiguous time a write process has, the lower its
> priority gets on the I/O end?

I would argue that I/O scheduling is not the same as process
scheduling, and that appropriate solutions should come from another
direction. Here is my line of reasoning (and the appeal to historical
BSD).

The underlying problem is that of apportioning each consumer a
"fair" share of some resource.  For process scheduling, that resource
is the processor across time. For I/O, the resource is device/bus
bandwidth. BSD has had decent bandwidth management facilities for
well over a decade.  Van Jacobsen and Mike Karels made the fundamental
advance in dealing with detecting and responding to network congestion
by identifying algorithms which converged on the fair-share allocation
of network bandwidth.

What we are looking at here is I/O congestion.  With a means of
determining the available bandwidth, detecting channel saturation,
and a means of throttling the requestors, there is no reason why
TCP congestion avoidance algorithms can't contribute to effective
management of I/O bandwidth.

The real challenge is mapping those techniques onto the I/O problem.
We don't have virtual circuits with window sizes to manipulate. But
I am certain that a little clever engineering can provide the
needed hooks. The main hurdle as I see it is adapting the method
in such a way as to be able to throttle down to fair-share under
congestion, and keep the algorithm cost low enough to not be a
problem when not in a congested state.

To speak more directly to Greywolf's query, I/O scheduling needs
to limit the arrival rate from processes that exceed their fair
share of the available device/bus bandwidth, for some suitable
value of "fair" (as amended by priority/administrative policy).
If we simply try to order the incoming requests to address the
problem, I fear that all that happens it to move the problem
to another workload pattern. The only real solution is to quench
the arrival rate at the source, and tie that to adequate metrics
of available bandwith and channel consumption. Using write time
as a surrogate for those metrics is not sufficiently accurate or
effective.

OK, enough abstraction.  Back to the practical problem of how
to implement this with real, portable code...

/sjd