tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [PATCH] bufq_priocscan enhancement



On Tue, Jun 14, 2011 at 01:37:16AM +0200, Marek Dopiera wrote:
> On Wednesday 08 June 2011 08:11:37 Thor Lancelot Simon wrote:
> > On Tue, Jun 07, 2011 at 10:49:10PM +0200, Marek Dopiera wrote:
> > > Moreover I think (but I have not checked that in any way) that we can
> > > reduce the time spent in interrupts by merging disk requests in the
> > > scheduler or increasing buffer cache page size (which may be beneficial
> > > on 4KB sector drives).
> >
> > Marek,
> >
> > Several of us have looked at this and even implemented it in hackish ways
> > for testing purposes.  There are some issues.  One is that it's tough to
> > cleanly merge requests without copying and to cleanly handle errors (the
> > "nestiobuf" abstraction looks like it could be used for this but errors
> > are basically not handled at all by that code).
> 
> It's good to know that someone already knows the problems. Do you possibly 
> know the branch names or have some patches elsewhere so that I could take a 
> look at it? I am aware (maybe not fully) that it's hard, however guys from 
> Linux do it somehow (I don't know how yet, though), so I'd like to at least 
> gain confidence that is impossible on NetBSD, or better try to solve it.

I don't have any patches, no.  You might ask Eric Haszlakiewicz.  Also,
the xbdback driver does something along these lines so it's not constantly
submitting small requests to the I/O subsystem.  Jed Davis did that.

> Still MAXPHYS is 64KB mostly, which is 128 times more than buffer cache page 
> size, so I think it's worth the game even though. If I'm correct, Linux uses 
> 128KB, so not much more and has lower CPU overhead thenn we do.

Well, no.  Most of the I/O that matters goes through the page cache, not
the metadata cache.  And the page cache clusters I/O to MAXPHYS already --
albeit rather poorly, particularly for write.

But the problem is when you have devices that multiplex.  Put a RAID with
8 data disks on top of your wd, and instead of submitting 1/2 as much at
a time as you could (64K when the device could do 128K) suddenly you are
submitting 1/16 as much as you could (8K when the device could do 128K).
I am pretty sure Linux manages to avoid losing in this way and can cluster
I/O to an appropriate size on metadevices or LVM (in this case, for large
writes and IDE disks undeneath, the appropriate size would be 8 * 128K or
1MB).

If you implement I/O request merging you're going to run into this much
more often.

To fix this requires making "MAXPHYS" a property of the device -- but to
do that correctly means propagating it up and down the device tree so
buses can limit the max request size for devices attached to them (consider
all the things an 'sd' can attach to and you will see why this is so).

Thor


Home | Main Index | Thread Index | Old Index