Subject: Re: Disk scheduling policy (Re: NEW_BUFQ_STRATEGY)
To: None <>
From: Jason Thorpe <>
List: tech-kern
Date: 12/01/2003 16:14:54
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII; format=flowed

On Dec 1, 2003, at 3:57 PM, Thor Lancelot Simon wrote:

> The way I read the SGI text, they do N requests from queue A, then
> N requests from queue B, and so forth.  A simple implementation of
> this seems like it might disrupt the elevator sort quite badly, so I
> wonder if they actually did something more clever.

They probably didn't have to do anything more clever.  SGI systems 
almost exclusively used SCSI disks (tuned a certain way), and could 
thus rely on the disk to mitigate any disruption of the elevator sort 
(through command reordering).

Really, it's not clear that the elevator sort buys you much anyway, 
when you're talking to raw disks, because disks don't really expose 
their real geometry anymore.

That said, elevator sort could potentially be VERY useful on RAID 
systems.  I know of a RAID card vendor whose firmware sets a timer when 
it receives a write request for a block within a given stripe, and then 
buffers the write ("stalls" it from the OS's perspective).  If, before 
the timer expires, writes that fill out the rest of the stripe are 
received, the firmware skips the r/m/w cycle for the stripe.  This can 
greatly improve performance.

This, of course, makes this card look horrible if you use dd(1) to test 
RAID-5 write performance, since each of the writes from the dd program 
are issued in lock-step.  However, if the disk queue sorting algorithm 
can arrange to group writes for a stripe together (not even necessarily 
in sequential order), then you can potentially have a major positive 
impact on overall system performance.

I would also like to see a disk sorting algorithm that could coalesce 
adjacent writes or reads into single requests (perhaps enqueueing an 
uber-buf that pointed to a list of sub-bufs that were treated as s/g 
elements, or something).  As part of this, I'd really like to add a 
bus_dmamamp_load_buf() that could handle various different data 
representations within "struct buf" (I have a project I'm currently 
planning that could really make use of attaching mbuf chains to bufs, 
rather than simple linear buffers).

         -- Jason R. Thorpe <>

content-type: application/pgp-signature; x-mac-type=70674453;
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit

Version: GnuPG v1.2.3 (Darwin)