tech-kern: Re: comparing raid-like filesystems

Subject: Re: comparing raid-like filesystems
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 02/02/2003 11:52:00
On Sun, Feb 02, 2003 at 03:03:29PM +0100, Manuel Bouyer wrote:
> On Sun, Feb 02, 2003 at 01:31:25PM +0000, Martin J. Laubach wrote:
> > |  Sorry, I miss remembered. The limit for IDE is 128k, not 64 (and much highter
> > |  with the extended read/write commands).
> > 
> >   So basically we want either
> > 
> >   (a) a (much?) larger MAXPHYS
> > 
> >   (b) no MAXPHYS at all and let the low level driver deal with
> >       segmenting large requests if necessary
> 
> (c) a per-device MAXPHYS
> 
> I had patches for this at one time, but it was before UVM and UBC.
> I have plans to look at this again

I'm actually working on this right now, but I've hit a bit of a snag.

As a first step, I've gone through the VM and FS code and replaced
inappropriate uses of MAXBSIZE with MAXPHYS (there's also a rather
amusing use of MAXBSIZE to set up a DMA map in the aic driver which I
found, and there may be a few other things like this I haven't found).

Maximum I/O size on pageout (write) is controlled by clustering in
the pager and by genfs_putpages().  But AFAICT, the only thing that
causes any "clustering" on reads -- the only thing that ever causes
us to read more than a single page at a time -- is the readahead in
genfs_getpages(), which is fixed at a size of 16 pages (MAX_READ_AHEAD,
also used to initialize genfs_rapages):

rapages = MIN(MIN(1 << (16 - PAGE_SHIFT), MAX_READ_AHEAD), genfs_rapages);

AFAICT if a user application does a read() of 8K, 64K is actually read.
If a user application does a read() of 32K, 64K is actually read.  If
a user application does a read() of 128K, two 64K page-ins are done into
the application's "window", each triggered by a single page fault on the
first relevant page.

This doesn't seem ideal.  Basically, we discard all information about
the I/O size the user requested and just blindly read MAX_READ_AHEAD pages
in a single transfer every time the user touches one that's not in.  So it
appears that MAXPHYS (or, in the future, its per-device equivalent) doesn't
control I/O size for reads at all.  And applications that are intelligent
about their read behaviour are basically penalized for it.

Am I reading this code wrong?  If so, I can't figure out what _else_ limits
a transfer through the page cache to 64K at present; could someone please
educate me?

This is problematic because it means that if we have a device with maxphys
of, say, 4MB, either we're going to end up reading 4MB every time we fault
in a single page, or we're never going to issue long transfers to it at all.
Yuck.

Obviously we could use physio to get around this restriction.  Without an
abstraction like "direct I/O" to allow physio-like access to blocks in
by the filesystem, however, this won't help many users in the real world.

Am I misunderstanding this?  If not, does anyone have any suggestions on
how to fix it?  A naive approach would seem to be to have read() call the
getpages() routine directly, asking for however many pages the user
requested, so we weren't relying on a single page fault (and, potentially,
uvm_fault()'s small lookahead) to bring in a larger chunk.  But we'd still
need to enforce a maximum I/O size _somewhere_; right now, it seems to
just be a happy coincidence that we have one at all, because:

1) uvm_fault() won't ask for more than a small lookahead (4 pages) at a
   time

2) genfs_getpages() won't read ahead more than 16 pages at a time

And the larger of those two limits just _happens_ to = MAXPHYS.

Thor