tech-kern: Re: comparing raid-like filesystems

Subject: Re: comparing raid-like filesystems
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-kern
Date: 02/02/2003 19:07:35
On Sun, Feb 02, 2003 at 11:52:00AM -0500, Thor Lancelot Simon wrote:
> > (c) a per-device MAXPHYS
> > 
> > I had patches for this at one time, but it was before UVM and UBC.
> > I have plans to look at this again
> 
> I'm actually working on this right now,

Cool

> but I've hit a bit of a snag.
> 
> As a first step, I've gone through the VM and FS code and replaced
> inappropriate uses of MAXBSIZE with MAXPHYS (there's also a rather
> amusing use of MAXBSIZE to set up a DMA map in the aic driver which I
> found, and there may be a few other things like this I haven't found).

Did you commit this yet ?

> 
> Maximum I/O size on pageout (write) is controlled by clustering in
> the pager and by genfs_putpages().  But AFAICT, the only thing that
> causes any "clustering" on reads -- the only thing that ever causes
> us to read more than a single page at a time -- is the readahead in
> genfs_getpages(), which is fixed at a size of 16 pages (MAX_READ_AHEAD,
> also used to initialize genfs_rapages):
> 
> rapages = MIN(MIN(1 << (16 - PAGE_SHIFT), MAX_READ_AHEAD), genfs_rapages);
> 
> AFAICT if a user application does a read() of 8K, 64K is actually read.
> If a user application does a read() of 32K, 64K is actually read.  If
> a user application does a read() of 128K, two 64K page-ins are done into
> the application's "window", each triggered by a single page fault on the
> first relevant page.

If I understand it properly, we can in fact get at most 128k read ahead 
(because genfs_racount is 2).

> 
> This doesn't seem ideal.  Basically, we discard all information about
> the I/O size the user requested and just blindly read MAX_READ_AHEAD pages
> in a single transfer every time the user touches one that's not in.  So it
> appears that MAXPHYS (or, in the future, its per-device equivalent) doesn't
> control I/O size for reads at all.  And applications that are intelligent
> about their read behaviour are basically penalized for it.

The size of the user request is in the vnode, right ? So it's there, we
just need to use it.

> 
> Am I reading this code wrong?  If so, I can't figure out what _else_ limits
> a transfer through the page cache to 64K at present; could someone please
> educate me?

It looks like the MAX_READ_AHEAD limit is currently because of this:
struct vm_page *pg, *pgs[MAX_READ_AHEAD];

So we can probably bump MAX_READ_AHEAD to something highter, at the expence
of more stack usage.
Ideally, the limit should be imposed only by the device drivers, but
dynamically allocating a temp variable in such a critical path is probably
not a good idea. Or allocate in on stack, but I don't know of any way to
allocate a variable size on stack (I don't think alloca is allowed in
kernel).  I think we'll have to impose some upper bound on the per-device
MAXPHYS anyway.

> 
> This is problematic because it means that if we have a device with maxphys
> of, say, 4MB, either we're going to end up reading 4MB every time we fault
> in a single page, or we're never going to issue long transfers to it at all.
> Yuck.
> 
> Obviously we could use physio to get around this restriction.  Without an
> abstraction like "direct I/O" to allow physio-like access to blocks in
> by the filesystem, however, this won't help many users in the real world.
> 
> Am I misunderstanding this?  If not, does anyone have any suggestions on
> how to fix it?  A naive approach would seem to be to have read() call the
> getpages() routine directly, asking for however many pages the user
> requested, so we weren't relying on a single page fault (and, potentially,
> uvm_fault()'s small lookahead) to bring in a larger chunk.  But we'd still
> need to enforce a maximum I/O size _somewhere_; right now, it seems to
> just be a happy coincidence that we have one at all, because:
> 
> 1) uvm_fault() won't ask for more than a small lookahead (4 pages) at a
>    time
> 
> 2) genfs_getpages() won't read ahead more than 16 pages at a time
> 
> And the larger of those two limits just _happens_ to = MAXPHYS.

I suspect this was not a coincidence :)
Also note that on systems with 8k pages (e.g. alpha), the UBC limit is 128k,
when MAXPHYS is still 64.

I'd say we just need to define an upper bound for MAXPHYS, and define
MAX_READ_AHEAD in term of MAXPHYS. This mean MAX_READ_AHEAD will probably
have to be defined in an MD header, as PAGE_SHIFT isn't constant. This could
also be usefull for some drivers.
We also need to define a read-ahead size different from MAX_READ_AHEAD,
which would be tuneable (ideally a sysctl).
Then we can change genfs_getpages() to read read-ahead size for small
user request, and min(user_request, device->maxphys, MAX_READ_AHEAD) for
the other ones, that is:
min(max(user_request, min_readahead), device->maxphys, MAX_READ_AHEAD)

Anyway, I think we can start working on per-device maxphys even without
changes to genfs_getpages(). I/O that don't go through UVM or the filesystem
would still benefit from it: direct access to the character device, and things
like raidframe's parity rebuild or component reconstruct.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 24 ans d'experience feront toujours la difference
--