Subject: Re: comparing raid-like filesystems
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 02/02/2003 11:52:00
On Sun, Feb 02, 2003 at 03:03:29PM +0100, Manuel Bouyer wrote:
> On Sun, Feb 02, 2003 at 01:31:25PM +0000, Martin J. Laubach wrote:
> > | Sorry, I miss remembered. The limit for IDE is 128k, not 64 (and much highter
> > | with the extended read/write commands).
> >
> > So basically we want either
> >
> > (a) a (much?) larger MAXPHYS
> >
> > (b) no MAXPHYS at all and let the low level driver deal with
> > segmenting large requests if necessary
>
> (c) a per-device MAXPHYS
>
> I had patches for this at one time, but it was before UVM and UBC.
> I have plans to look at this again
I'm actually working on this right now, but I've hit a bit of a snag.
As a first step, I've gone through the VM and FS code and replaced
inappropriate uses of MAXBSIZE with MAXPHYS (there's also a rather
amusing use of MAXBSIZE to set up a DMA map in the aic driver which I
found, and there may be a few other things like this I haven't found).
Maximum I/O size on pageout (write) is controlled by clustering in
the pager and by genfs_putpages(). But AFAICT, the only thing that
causes any "clustering" on reads -- the only thing that ever causes
us to read more than a single page at a time -- is the readahead in
genfs_getpages(), which is fixed at a size of 16 pages (MAX_READ_AHEAD,
also used to initialize genfs_rapages):
rapages = MIN(MIN(1 << (16 - PAGE_SHIFT), MAX_READ_AHEAD), genfs_rapages);
AFAICT if a user application does a read() of 8K, 64K is actually read.
If a user application does a read() of 32K, 64K is actually read. If
a user application does a read() of 128K, two 64K page-ins are done into
the application's "window", each triggered by a single page fault on the
first relevant page.
This doesn't seem ideal. Basically, we discard all information about
the I/O size the user requested and just blindly read MAX_READ_AHEAD pages
in a single transfer every time the user touches one that's not in. So it
appears that MAXPHYS (or, in the future, its per-device equivalent) doesn't
control I/O size for reads at all. And applications that are intelligent
about their read behaviour are basically penalized for it.
Am I reading this code wrong? If so, I can't figure out what _else_ limits
a transfer through the page cache to 64K at present; could someone please
educate me?
This is problematic because it means that if we have a device with maxphys
of, say, 4MB, either we're going to end up reading 4MB every time we fault
in a single page, or we're never going to issue long transfers to it at all.
Yuck.
Obviously we could use physio to get around this restriction. Without an
abstraction like "direct I/O" to allow physio-like access to blocks in
by the filesystem, however, this won't help many users in the real world.
Am I misunderstanding this? If not, does anyone have any suggestions on
how to fix it? A naive approach would seem to be to have read() call the
getpages() routine directly, asking for however many pages the user
requested, so we weren't relying on a single page fault (and, potentially,
uvm_fault()'s small lookahead) to bring in a larger chunk. But we'd still
need to enforce a maximum I/O size _somewhere_; right now, it seems to
just be a happy coincidence that we have one at all, because:
1) uvm_fault() won't ask for more than a small lookahead (4 pages) at a
time
2) genfs_getpages() won't read ahead more than 16 pages at a time
And the larger of those two limits just _happens_ to = MAXPHYS.
Thor