Re: [PATCH] bufq_priocscan enhancement

To: Marek Dopiera <dopiera%students.mimuw.edu.pl@localhost>
Subject: Re: [PATCH] bufq_priocscan enhancement
From: Thor Lancelot Simon <tls%panix.com@localhost>
Date: Mon, 13 Jun 2011 20:54:19 -0400

On Tue, Jun 14, 2011 at 01:37:16AM +0200, Marek Dopiera wrote:
> On Wednesday 08 June 2011 08:11:37 Thor Lancelot Simon wrote:
> > On Tue, Jun 07, 2011 at 10:49:10PM +0200, Marek Dopiera wrote:
> > > Moreover I think (but I have not checked that in any way) that we can
> > > reduce the time spent in interrupts by merging disk requests in the
> > > scheduler or increasing buffer cache page size (which may be beneficial
> > > on 4KB sector drives).
> >
> > Marek,
> >
> > Several of us have looked at this and even implemented it in hackish ways
> > for testing purposes.  There are some issues.  One is that it's tough to
> > cleanly merge requests without copying and to cleanly handle errors (the
> > "nestiobuf" abstraction looks like it could be used for this but errors
> > are basically not handled at all by that code).
> 
> It's good to know that someone already knows the problems. Do you possibly 
> know the branch names or have some patches elsewhere so that I could take a 
> look at it? I am aware (maybe not fully) that it's hard, however guys from 
> Linux do it somehow (I don't know how yet, though), so I'd like to at least 
> gain confidence that is impossible on NetBSD, or better try to solve it.

I don't have any patches, no.  You might ask Eric Haszlakiewicz.  Also,
the xbdback driver does something along these lines so it's not constantly
submitting small requests to the I/O subsystem.  Jed Davis did that.

> Still MAXPHYS is 64KB mostly, which is 128 times more than buffer cache page 
> size, so I think it's worth the game even though. If I'm correct, Linux uses 
> 128KB, so not much more and has lower CPU overhead thenn we do.

Well, no.  Most of the I/O that matters goes through the page cache, not
the metadata cache.  And the page cache clusters I/O to MAXPHYS already --
albeit rather poorly, particularly for write.

But the problem is when you have devices that multiplex.  Put a RAID with
8 data disks on top of your wd, and instead of submitting 1/2 as much at
a time as you could (64K when the device could do 128K) suddenly you are
submitting 1/16 as much as you could (8K when the device could do 128K).
I am pretty sure Linux manages to avoid losing in this way and can cluster
I/O to an appropriate size on metadevices or LVM (in this case, for large
writes and IDE disks undeneath, the appropriate size would be 8 * 128K or
1MB).

If you implement I/O request merging you're going to run into this much
more often.

To fix this requires making "MAXPHYS" a property of the device -- but to
do that correctly means propagating it up and down the device tree so
buses can limit the max request size for devices attached to them (consider
all the things an 'sd' can attach to and you will see why this is so).

Thor

References:
- [PATCH] bufq_priocscan enhancement
  - From: Marek Dopiera
- Re: [PATCH] bufq_priocscan enhancement
  - From: Thor Lancelot Simon
- Re: [PATCH] bufq_priocscan enhancement
  - From: Marek Dopiera

Prev by Date: Re: extended attributes (need help)
Next by Date: Re: extended attributes
Previous by Thread: Re: [PATCH] bufq_priocscan enhancement
Next by Thread: Re: [PATCH] bufq_priocscan enhancement
Indexes:

Home | Main Index | Thread Index | Old Index