Subject: Re: yamt-readahead branch
To: YAMAMOTO Takashi <>
From: Chuck Silvers <>
List: tech-kern
Date: 11/15/2005 19:02:53

thanks for taking this on, I'm glad someone got around to it.
I have some comments about the design:

 - I don't think that having the read ahead state in struct file
   is the way to go, for a couple reasons:

	- it doesn't handle multi-threaded programs, nor groups of processes
	  that inherit file descriptors from a common parent.
	- it doesn't handle doing read ahead from page-faults.

   I think it would be much better to keep the read ahead state in the
   vnode (or rather the genfs_node) and detect multiple read patterns
   heuristically.  in the common case, there's only one sequential pattern
   so detecting that is really easy.

 - as for the read ahead policy, we should have a sliding-window kind of
   scheme, such that we can keep multiple disk I/Os pending in the disk driver
   all the time (or at least until the application takes a breather).
   ie. we shouldn't wait for the application to copy out all the data that
   we've read ahead before starting more I/Os.  (actually on second look,
   I think you already do this, but it's hard to tell since you didn't
   describe your algorithm at all.)

 - there needs to be some feedback for scaling back the amount of read ahead
   that we do in case memory gets low.  otherwise we'll have cases where we
   do a bunch of read ahead, but the memory is reclaimed for a different
   purpose before the application catches up.

 - I see you have some XXX comments about tuning the amount of data to
   read ahead based on physical memory, which is true, but we should also
   tune based on the I/O throughput of the underlying device.  we want to
   be able to keep any device 100% busy, ideally without the user needing
   to configure this manually.  but we'll need some way to allow manual
   per-device tuning as well.

and some comments on the implementation:

 - if this is going to be part of UVM, then we should use the MADV_* constants
   rather than the POSIX_FADVISE_* ones.  or the other way around, but we
   we should use the same ones for both read() and reading from mappings.
 - please use a DPRINTF kind of macro instead of having a bunch of
   "if defined(READAHEAD_DEBUG)".  the former is much easier on the eyes.

 - please add some comments to the code describing how it's supposed to work.


On Tue, Nov 15, 2005 at 04:32:30PM +0900, YAMAMOTO Takashi wrote:
> hi,
> i made some read-ahead related changes on the branch.
> unless any serious problem is found, i'll merge it to trunk shortly.
> so please don't hesitate to review the changes.
> summary:
> 	- switch from contextless-readahead.
> 	- isolate read-ahead code so that you can implement
> 	  and evalutate other algorithms more easily.
> 	- implement posix_fadvise.
> todo:
> 	- test posix_fadvise.
> 	- posix_fadvise manpage.  can take from TOG one?
> 	- remove old genfs read-ahead code.
> 	- update vop_read_args comments for each filesystems.
> a quick benchmark (bonnie++ -n0) is attached.
> *1 and *2 are the same kernel.  i did the same test twice.
> i'm wondering why "Rewrite" value has been changed.
> YAMAMOTO Takashi