tech-kern: Re: UBC status

Subject: Re: UBC status
To: Neil A. Carson <neil@causality.com>
From: Eduardo E. Horvath <eeh@one-o.com>
List: tech-kern
Date: 09/25/1999 14:10:21
On Sat, 25 Sep 1999, Neil A. Carson wrote:

> Chuck Silvers wrote:
> 
> > yea, I'm not very excited about a limit on cached file data either,
> > but many people have talked about such a thing so I listed it tentatively.
> > I was including limiting dirty pages under "pagedaemon optimizations"...
> > could you elaborate on the extremely clever ways this could be avoided?

[Description of pageout issues deleted]

There are two different issues here: handling dirty pages and allocating
clean pages for new buffers.  

There are many solutions for problems caused by dirty buffers and most are
not that complex.

> FreeBSD works around this by having a small limit on the amount of dirty
> data despite allowing the cache to grow. This works very well in
> practice, althoughg I don't really believe this to be the solution
> either, since all the buffer cache junk in there still has the 'blow out
> in one go' problem (although by default you don't notice it).

Does FreeBSD have a limit on page allocations?

> I think the real rules you need to play by would be something like:
> 	- Always keep the IO subsystem active as regards spooling
> 	  dirty data.

I always thought that trying to run the pagescanner at a very low rate in
the idle loop would be a good idea.  Since the system is idle you're not
stealing CPU cycles from something more important.  However, the CPU may
be idle because it's waiting for I/O, and the last thing you want to do to
a system that's already thrashing under a high I/O load is to add some
more.

> 	- Implement an IO prioritisation scheme (with some
> 	  heuristics based on drive head location etc) which places
> 	  interactive operations over trickle page-outs

Interesting, but rather complicated since it requires extremely good
sharing of information between the disk and HBA drivers and the
pagedaemon.

> 	- If the amount of dirty data starts to accumulate too
> 	  much (ie the IO subsystems are continually saturated)
> 	  then stop it growing further.

Solaris has a nice solution to this problem.  Traditionally, the update
daemon would run every 30 seconds to flush all dirty buffers to disk.
When machines started to have 64MB, 256MB, or more of RAM, dumping
possibly 100's of MB of data to the disk all at the same time.  The
solution to that was to run it every 10s over 1/5 of RAM on the system,
leading to a much more even I/O load.

The problem with this is that the buffers were tracked by inode, and sync
used to operate over inodes, so they needed to rewrite it to operate on
pages.  The result was an increase in CPU usage during scanning.

A similar solution could be designed that scans through some fraction of
active inodes in the system.   Or when a dirty page is created, the
associated inode could be timestamped and after 30 seconds it could be
flushed.

> 
> In this way, I guess, you effectively have an 'adaptive limit' on the
> amount of dirty data.
> 
> Does this make sense?

Yes.

Now on the more interesting side, what to do about page allocation.  

The high water mark solution is well tested but seems rather arbitrary.
You would want different settings depending on system RAM, current load,
the types of jobs running, etc.

Wiring down pages just because they are shared by a lot of processes does
not seem like such a good idea if those pages are used very seldom.

I have long been speculating whether we could make use of a separate set
of active and inactive memory lists only for buffer cache pages that would
be scanned at a faster rate than the current ones, allowing faster re-use
of buffer pages but not requiring a hard high-water mark.

=========================================================================
Eduardo Horvath				eeh@one-o.com
	"I need to find a pithy new quote." -- me