Subject: Re: New device buffer queue strategy
To: Chris Jepeway <jepeway@blasted-heath.com>
From: Chuck Silvers <chuq@chuq.com>
List: tech-perform
Date: 09/03/2002 22:05:59
hi,

as you allude to, the i/os sent from the FS/VM layer to the disk driver
will already have clustering done, so there's little to be gained by
doing it again in the disk driver.  the exception to this would be for
layered disk drivers like raidframe, where non-contiguous chunks of i/o
in the virtual device presented to the FS/VM layer can become contiguous
for the underlying real devices.  for such layered disk drivers, this
re-clustering could potentially be useful.  I'd like to see some
empirical evidence that such code helps before it went into the tree, though.

-Chuck


On Mon, Sep 02, 2002 at 05:35:25PM -0400, Chris Jepeway wrote:
> > Finally, I wonder if it might make sense to attempt to merge adjacent
> > requests up to MAXPHYS at queue insert time.
> I've got some code I wrote for a client that's called as
> 
> 	bp = blk_cluster(sd->buf_queue, sector_size);
> 
> at queue *removal* time.  It gangs together adjacent buffers at the
> front of the queue up to MAXPHYS in length.  It uses uvm_km_kmemalloc(),
> vtophys() and pmap_kenter_pa() to cobble up a new buffer with a b_data
> pointing to VA that maps all the b_data of the individual adjacent
> buffers.  If the buffer at the head of the queue isn't adjacent to
> the second buffer in the queue, blk_cluster() just returns the first buffer.
> 
> The b_iodone for cobbled buffers will biodone() all the buffers it
> gangs together, much like the old cluster_save code that left when UBC
> went in.
> 
> Doing the clustering at removal time instead of at insertion time
> lets you know how much total VA you need and lets you request it
> all at once.  If you cluster at insertion time, you've got to either
> build up your VA incrementally, calling uvm_km_kmemalloc() more than
> once, or just ask for MAXPHYS VA, which could be too greedy, particularly
> if MAXPHYS goes dynamic.
> 
> > Given fixed (and substantial!)
> > command overhead, simply reducing the number of I/O requests might help
> > more than one might think, particularly with request sources such as
> > RAIDframe that are known to produce smaller requests than the disks can
> > handle.
> This client has a proprietary filesystem that issues small-ish requests
> a la RAIDframe.  blk_cluster() measurably improves that f/s's performance.
> I don't have any RAIDframe stats, nor any for FFS, but when the proprietary
> f/s uses clustering, it gets a 4X improvement in throughput on old-ish disks.
> Take the 4X with a grain of salt, since the client's f/s doesn't do any
> up-front clustering the way UBC does.
> 
> They'd like to release the code to the community if there's interest.
> They're in a bit of a crunch at present, though, so it might take me
> a week or perhaps more to get it cleared with them.  Adapting it to
> -current with the new BUFQ_{PUT,GET}() should go quickly once they
> OK the release.
> 
> If folk think there's a chance an interface like blk_cluster() might make
> it into NetBSD proper, I'm willing to do the legwork to get this code
> released for scrutiny by the world at large.  Once it's vetted and if
> it's accepted, I suspect my client would fund some of my time to adapt
> it as necessary for buy-back by the NetBSD group.
> 
> Let me know.
> 
> > Thor
> Chris <jepeway@blasted-heath.com>.