Subject: IO Congestion Control
To: Bill Studenmund <wrstuden@netbsd.org>
From: Sumantra Kundu <sumantra@gmail.com>
List: tech-kern
Date: 09/11/2006 10:32:56
Hi,

Recent efforts to improve the performance of NetBSD system by
penalizing processes that initiate a "lot" of writes (a.k.a page
flushes), was initially carried out by making such processes sleep for
a definite amount of time. The idea was to penalize writes so that
processes initiating reads experienced lower IO service time.

However, during the course of evaluating the performance of the above
approach, it was discovered that such I/O throttling effort fails if
there are no free pages available for the processes to read in the
data. As a result, overall system performance was not noticeably
improved. Results for all these observation were recorded in the
thread: http://mail-index.netbsd.org/tech-kern/2006/08/21/0020.html

Taking cue from the above observation, we now intend to implement a
congestion control algorithm (uvm_cca) inside the uvm. However,
instead of observing process behaviour, we would now intend to "infer
congestion" by observing the dynamics of dirty pages, w.r.t to a
specific IO device.

Since no two IO devices are the same, this implies we need to have a
mechanism that is able to capture and understand the "capabilities",
"limitations", and "performance" of such a device at run time and make
such performance figures available to the  UVM, before any sort of
device directed IO throttling could be initiated. To top it, writes
need not be of the same cost and can generally be thought of a
function of the disk seek time.

In a nutshell, the uvm_cca aims to induce IO throttling by taking into
account: (i) the dynamics of dirty pages inside the uvm, (ii) IO
device capability and performance, and (iii) understanding the cost of
disk seeks associated with each write requests.

To highlight a scenario where such an uvm_cca might be useful, let me
take a leaf out of Bill's email:

"Consider a hard disk that can write about 15 MiB/sec (2^20 MB, not 10^6
MB). Now consider this disk being the destination of a video capture
stream. Non-high-definition TV needs less than 4 MB/sec, so the disk can
capture the video easily, as long as we keep it flowing.

Now say we have a video capture app going, and other things end up writing
to the disk. Some of this may be metadata, some of this may be other
programs. The problem is if we have to seek much. Seek times are rated at
between 6 and 9 ms. Let's say 4 ms for now, to under-estimate. 4ms
represents almost 63k worth of data. So something that means we have to
seek somewhere other than the video stream then seek back puts an 8 ms
delay in our write, which is 126k or so. So if we allow too many little
writes in, we can mess ourselves up."


Comments/Feedbacks/Directives are appreciated,

Thanks,
Sumantra