Subject: Re: yamt-readahead branch
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 11/16/2005 23:10:41
On Wed, Nov 16, 2005 at 12:44:36PM +0900, YAMAMOTO Takashi wrote:
> >  - I don't think that having the read ahead state in struct file
> >    is the way to go, for a couple reasons:
> > 
> > 	- it doesn't handle multi-threaded programs, nor groups of processes
> > 	  that inherit file descriptors from a common parent.
> > 	- it doesn't handle doing read ahead from page-faults.
> > 
> >    I think it would be much better to keep the read ahead state in the
> >    vnode (or rather the genfs_node) and detect multiple read patterns
> >    heuristically.  in the common case, there's only one sequential pattern
> >    so detecting that is really easy.
> 
> i made it per-file because:
> 
> 	- fadvise() is per-file.  (well, it is per-range actually.)

sure, but that's not really relevant.  the read ahead state is more than
just the advise.


> 	- i think that a file is a good enough approximate of a requester
> 	  in common cases.

well, the most common case is that only one thread reads a file at a time.
for that we don't need per-struct-file state.  I think the next most common
case is that multiple threads read the file using the same file descriptor.
for this, per-struct-file state doesn't help.  the case where multiple threads
read the file simultaneously using separate file descriptors seems relatively
rare.  if it's important to optimize for the multiple-fd case, then it's also
important to optimize for the single-fd case, so per-struct-file state isn't
very useful overall.


> 	- uvm_fault() has its own read-ahead mechanism.
> 	  and we can easily fall back to per-vnode or per-mapping context
> 	  if desirable.

faulting on mappings should generate the same read ahead behaviour as read(),
ideally using the same code.  I don't think that adding a read-ahead context
to a vm_map_entry would be good either, since that also doesn't handle
multiple threads in the same process.


> 	- building a heuristic which can handle multiple stream is hard. :-)

I agree it's not trivial, but that should make it all the more satisfying.  :-)


> 	  do you have any good idea?

well, for my day job I implemented a version of this myself several years
ago (and spent forever tweaking it to not cause problems with various
single-threaded patterns), and some other guys later tossed that and
wrote a different version.  however, neither of these is open-source and
to make it worse, both algorithms are patented, so we'll need to do
something different.  I'll look up the patent numbers so you can see
what we need to avoid.  I have some ideas on other ways we can do this
that should avoid the patents, but I'll let you think about it for a bit
separately first, hopefully you'll have some fresh ideas.

for now, I think it would be a good enough improvement to just support
the same single-threaded forward-sequential read ahead that we've always
had, with the additional improvement of the sliding-window multiple-I/Os-
in-flight policy that you already wrote.  we can consider how to handle
multiple reader threads as a separate task.


> >  - there needs to be some feedback for scaling back the amount of read ahead
> >    that we do in case memory gets low.  otherwise we'll have cases where we
> >    do a bunch of read ahead, but the memory is reclaimed for a different
> >    purpose before the application catches up.
> > 
> >  - I see you have some XXX comments about tuning the amount of data to
> >    read ahead based on physical memory, which is true, but we should also
> >    tune based on the I/O throughput of the underlying device.  we want to
> >    be able to keep any device 100% busy, ideally without the user needing
> >    to configure this manually.  but we'll need some way to allow manual
> >    per-device tuning as well.
> 
> sure.
> they should be on a todo list, but not for this branch.
> 
> sorry for not being clear, finding the best algorithm is not a goal of
> this branch.  uvm_readahead.c is merely a "sample algorithm".
> i think it works fine enough for common cases, tho.

doing this in steps is fine, I just wanted to get more of the issues
out on the table.


> or, do you think they can't be done with the current uvm_ra_* api?

the uvm_ra_* stuff is fine, I just don't like keeping the state via
struct file and the argument to VOP_READ().

-Chuck