Subject: System performance during write-heavy workload: need *flow control*
To: None <tech-kern@netbsd.org>
From: Bill Sommerfeld <sommerfeld@orchard.arlington.ma.us>
List: tech-kern
Date: 01/28/2001 11:24:25
This message may raise more questions than it answers..

Folks discussing the UBC performance problems have largely been
focussed on its effects in stealing already-allocated pages from
active applications as well as efforts to speed up cleaning of certain
types of pages.

While this may help with certain workloads, let's think about the

	dd if=/dev/zero of=bigfile count=infinity

case for a while.  

Letting this work well without trashing system performance is
fundamentally a flow-control problem -- rather than speeding up how
quickly pages are recycled, you need to slow down how quickly you let
"dd" dirty new UBC pages..

let's consider a hypothetical system:

	- 1GB main memory
	- 10MB/s disk write bandwidth.	
	- 100MB/s memory write bandwidth (writing to newly allocated pages,
		including cost of allocation and vm system overhead).

In the long run, it makes no sense to allocate more than 10MB/s of new
pages for the "dd" process to fill, because the disk can't consume
pages faster than that.  However, it can consume pages at rate 100
times faster than that because of the slow disk, and very quickly
develop a backlog which will take many seconds to clear.  I don't
think it makes sense to do this when there are other processes in the
system which could make forward progress if they had those pages.

We'd be far better off if we only allowed "dd" to absorb a few seconds
worth of pages at most..

Consider the case of running the following workloads simultaneously:

	- dd if=/dev/zero of=bigfile count=infinity
	- a kernel build
	- interactive usage.

In the long run, the "dd" process is limited by the rate at which the
disk can absorb writes; however, in the short term, it will consume
UBC pages as fast as it's allowed to.

Each compile process will be largely compute bound, allocate a few
thousand anon pages, read a few hundred pages (largely header files
which are likely to be the same from compile to compile), and write a
few pages worth of data to the filesystem.

Interactive usage steals cycles here and there and occasionally
allocates a bunch of new pages for short-running jobs.

It's also worth noting that the dd count=infinity case is also very
similar to a process doing something similar to anonymous memory:

	for(;;) { char *p = sbrk(PAGE_SIZE); p[0] = 1; }

The problem isn't so much a "file pages vs anon pages" competition,
but a "dirty pages containing write-only data" vs "pages we are likely
to look at again".

Clearly, you want the system to be able to absorb bursts of writes
without needlessly blocking the process creating them.  But you don't
also want to completely drain the free page list *and* evict pages
containing useful data which are likely to be needed again soon if all
you're doing is creating a huge backlog of dirty pages at a rate which
the disk won't be able to absorb..

TCP prevents a single connection with a high-rate source and a
low-rate sink from eating the sending maching by using flow-control
windows... once a connection has mroe than some number of bytes
buffered, future writes block until the receiver has acked the data
and opened up more window space.

In order to survive the "big writer" workloads, something akin to this
needs to be added to UBC/UVM; the real trick is figuring out the
appropriate place and scope for the "window" accounting needed to
figure out when to apply back-pressure to writers.

Now the big question is, at what scope do we impose a flow control
check when going to allocate a new page backed by something in the
scope?

	- controller
	- spindle 

These would likely let us estimate I/O bandwidth more closely,
preventing saturation of one I/O heavy job from starving others;
however, this doesn't help the very common case of single-controller,
single-spindle systems.  also, it's unlikely that you could usefully
push the information through a subsystem like raidframe..

	- partition/filesystem

This is less accurate; on single-spindle systems where you have just
root & swap, it really only handles anons vs. filesystem contention..

	- vm object for vnode pages; pmap for anons

this might be the right scope but it's also pretty far removed from
the hardware..

					- Bill