tech-kern: Re: comparing raid-like filesystems

Subject: Re: comparing raid-like filesystems
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 02/02/2003 12:50:38
hi,

what you're observing here is that the current genfs read ahead
isn't terribly smart, which is true.  the previous read ahead code
in the now-gone vfs_cluster.c was only slightly smarter, so the
genfs stuff seemed good enough at the time.  it can certainly be
improved, though.  a better way might be to use an adaptive approach
similar to TCP's mechanism for adjusting the window size.

the read ahead size chosen by the genfs code is just to match the
default size of the old read ahead, which was MAXBSIZE, ie. 64k,
which also happened to be MAXPHYS on most platforms.  it's not
a coincidence, but the relationship has gotten obscured by now.

one reason to keep a smallish file system i/o size yet is that
since we have to map pages to do i/o on them, fragmentation in
pager_map could become an issue.  obviously this won't be a problem
once we redo the device i/o interface to work directly on physical
pages instead of requiring that they be mapped.  however, that's
a major project and no one has started on it yet.

having read() call VOP_GETPAGES() all the time would introduce a lot of
extra overhead for cache hits.  currently if the data to satisify a read()
is in the page cache, and the UBC mapping for that page is in the mapping
cache, then we never call VOP_GETPAGES() and just access the already-mapped
pages directly.

doing read ahead out of VOP_GETPAGES() is good in that it handles both
read() and faults on mappings.  but it could benefit from the extra info
available in the read() case.

-Chuck


On Sun, Feb 02, 2003 at 11:52:00AM -0500, Thor Lancelot Simon wrote:
> On Sun, Feb 02, 2003 at 03:03:29PM +0100, Manuel Bouyer wrote:
> > On Sun, Feb 02, 2003 at 01:31:25PM +0000, Martin J. Laubach wrote:
> > > |  Sorry, I miss remembered. The limit for IDE is 128k, not 64 (and much highter
> > > |  with the extended read/write commands).
> > > 
> > >   So basically we want either
> > > 
> > >   (a) a (much?) larger MAXPHYS
> > > 
> > >   (b) no MAXPHYS at all and let the low level driver deal with
> > >       segmenting large requests if necessary
> > 
> > (c) a per-device MAXPHYS
> > 
> > I had patches for this at one time, but it was before UVM and UBC.
> > I have plans to look at this again
> 
> I'm actually working on this right now, but I've hit a bit of a snag.
> 
> As a first step, I've gone through the VM and FS code and replaced
> inappropriate uses of MAXBSIZE with MAXPHYS (there's also a rather
> amusing use of MAXBSIZE to set up a DMA map in the aic driver which I
> found, and there may be a few other things like this I haven't found).
> 
> Maximum I/O size on pageout (write) is controlled by clustering in
> the pager and by genfs_putpages().  But AFAICT, the only thing that
> causes any "clustering" on reads -- the only thing that ever causes
> us to read more than a single page at a time -- is the readahead in
> genfs_getpages(), which is fixed at a size of 16 pages (MAX_READ_AHEAD,
> also used to initialize genfs_rapages):
> 
> rapages = MIN(MIN(1 << (16 - PAGE_SHIFT), MAX_READ_AHEAD), genfs_rapages);
> 
> AFAICT if a user application does a read() of 8K, 64K is actually read.
> If a user application does a read() of 32K, 64K is actually read.  If
> a user application does a read() of 128K, two 64K page-ins are done into
> the application's "window", each triggered by a single page fault on the
> first relevant page.
> 
> This doesn't seem ideal.  Basically, we discard all information about
> the I/O size the user requested and just blindly read MAX_READ_AHEAD pages
> in a single transfer every time the user touches one that's not in.  So it
> appears that MAXPHYS (or, in the future, its per-device equivalent) doesn't
> control I/O size for reads at all.  And applications that are intelligent
> about their read behaviour are basically penalized for it.
> 
> Am I reading this code wrong?  If so, I can't figure out what _else_ limits
> a transfer through the page cache to 64K at present; could someone please
> educate me?
> 
> This is problematic because it means that if we have a device with maxphys
> of, say, 4MB, either we're going to end up reading 4MB every time we fault
> in a single page, or we're never going to issue long transfers to it at all.
> Yuck.
> 
> Obviously we could use physio to get around this restriction.  Without an
> abstraction like "direct I/O" to allow physio-like access to blocks in
> by the filesystem, however, this won't help many users in the real world.
> 
> Am I misunderstanding this?  If not, does anyone have any suggestions on
> how to fix it?  A naive approach would seem to be to have read() call the
> getpages() routine directly, asking for however many pages the user
> requested, so we weren't relying on a single page fault (and, potentially,
> uvm_fault()'s small lookahead) to bring in a larger chunk.  But we'd still
> need to enforce a maximum I/O size _somewhere_; right now, it seems to
> just be a happy coincidence that we have one at all, because:
> 
> 1) uvm_fault() won't ask for more than a small lookahead (4 pages) at a
>    time
> 
> 2) genfs_getpages() won't read ahead more than 16 pages at a time
> 
> And the larger of those two limits just _happens_ to = MAXPHYS.
> 
> Thor