tech-perform: Re: SGI disk sorting algorithm, thoughts on disksort() lossage

Subject: Re: SGI disk sorting algorithm, thoughts on disksort() lossage
To: Thor Lancelot Simon <tls@rek.tjls.com>
From: Aidan Cully <aidan@kublai.com>
List: tech-perform
Date: 06/21/2002 23:27:19

On Fri, Jun 21, 2002 at 04:55:29PM -0400, Thor Lancelot Simon wrote:
> I can't find the whitepaper.  Here's what I could find, from
> http://www.sgi.com/developers/feature/2001/roadmap.html:
> 
> > Disk sorting in 6.5.8
> > 
> > Previously, all disk requests were sorted by block
> > number.  Unfortunately, if the filesystem write activity was more than
> > the disk could satisfy, the disk could get swamped with delayed write
> > requests. This would result in reads and synchronous writes being delayed
> > for extensive periods. In extreme cases, the system would appear to stall
> > or would experience NFS timeouts.
> > 
> > In 6.5.8, the queues are split. Doing this permits queuing delayed writes
> > into one queue, while synchronous writes and reads are entered into another
> > queue. In 6.5.8 the disk driver will alternate between queues. This ensures
> > that a large queue of delayed write requests will not adversely impact
> > interactive response. If both delayed writes and other requests are pending,
> > the driver will alternate between them, issuing several delayed writes,
> > then several of the other requests. Selecting several from each queue each
> > time, rather than just one from each queue each time, makes sequential
> > I/O faster and disk performance is maximized.
> 
> I can attest that this makes a truly surprising number of "system appears
> to hang while under heavy I/O load" problems go away -- alternating between
> the read/sync write and delayed-write queues lets new executables page in, 
> etc. and makes up for even truly pathological buffer-cache behaviour while 
> under heavy write load.  It's a pretty simple trick, too.

This is nice looking to me.  Mostly that it looks easier to implement
than a CPU-like I/O scheduler...  Two things occur to me: all writes
probably start out on the delay queue then have to migrate to the
synchronous queue in response to fsync(), msync(), others?, and it's
probably still possible to cause problems by forcing all writes to be
synchronous.  A local user could reimplement dd if=/dev/zero with an
fsync after a large enough number of writes as a kind of DOS...  though
there are probably simpler and more effective ways of hosing a machine
when you've got terminal access (fork bombs still work?).

--aidan