tech-kern: Re: IO Congestion Control

Subject: Re: IO Congestion Control
To: Sumantra Kundu <sumantra@gmail.com>
From: Steven M. Bellovin <smb@cs.columbia.edu>
List: tech-kern
Date: 09/11/2006 20:56:24
On Mon, 11 Sep 2006 19:36:25 -0500, "Sumantra Kundu" <sumantra@gmail.com>
wrote:

> >> L_(i+1) = k*L_i + (1-k)*L
> 
> As Bill had remarked, it is difficult to understand how you could map
> network congestion control algos in our case.  What is shown here is
> EWMA technique which is not an algorithm but a mechanism to smooth out
> transient spikes.
> 
> Congestion control comes in the picture only after we know that a
> congestion has occur. In TCP this is identified  by lost packets. And
> as we  noticed, IO congestion is affected by  three main factors: (i)
> the ratio of dirty vs free pages, (ii) the effective service time of
> the writes, and (iii) the nature of writes. Note: all these approaches
> are generic and is not tied to any specific technology. What really
> changes are the queuing and performance delay figures which are just
> inputs to our framework.
> 
Your original post spoke of

	a mechanism that is able to capture and understand the
	"capabilities", "limitations", and "performance" of such a device
	at run time and make such performance figures available to the
	UVM, before any sort of device directed IO throttling could be
	initiated.

I'm suggesting this equation as an alternative to your proposed
mechanism.  (Apart from anything else, write latency depends on higher
layers as well as devices.  A write request that stays within the
cylinder or cylinder group, per the allocation algorithms (and yes, I
know their limititations as applied to modern hardware) will be much less
costly than a reference to a new (physical) cylinder, which would require
a costly seek.)

As for the question you ask here -- as always, my instinct is to try the
simplest thing first.  Again, look at that section of 793; after computing
the smoothed round trip time, TCP multiplies that by some factor and
decides that a retransmission is needed when that figure is exceeded.  So
-- calculate an analogous figure for disk latency; when a write request
exceeds that amount, penalize the requesting process (or, for that matter,
the next process doing a write operation on that device -- statistically,
it will be about the same, and the latter algorithm is even better because
it penalizes processes that force other process' dirty pages to be
flushed.)

		--Steven M. Bellovin, http://www.cs.columbia.edu/~smb