tech-kern: SGI disk sorting algorithm, thoughts on disksort() lossage

Subject: SGI disk sorting algorithm, thoughts on disksort() lossage
To: None <tech-kern@netbsd.org, tech-perform@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 06/21/2002 16:55:29
I can't find the whitepaper.  Here's what I could find, from
http://www.sgi.com/developers/feature/2001/roadmap.html:

> Disk sorting in 6.5.8
> 
> Previously, all disk requests were sorted by block
> number.  Unfortunately, if the filesystem write activity was more than
> the disk could satisfy, the disk could get swamped with delayed write
> requests. This would result in reads and synchronous writes being delayed
> for extensive periods. In extreme cases, the system would appear to stall
> or would experience NFS timeouts.
> 
> In 6.5.8, the queues are split. Doing this permits queuing delayed writes
> into one queue, while synchronous writes and reads are entered into another
> queue. In 6.5.8 the disk driver will alternate between queues. This ensures
> that a large queue of delayed write requests will not adversely impact
> interactive response. If both delayed writes and other requests are pending,
> the driver will alternate between them, issuing several delayed writes,
> then several of the other requests. Selecting several from each queue each
> time, rather than just one from each queue each time, makes sequential
> I/O faster and disk performance is maximized.

I can attest that this makes a truly surprising number of "system appears
to hang while under heavy I/O load" problems go away -- alternating between
the read/sync write and delayed-write queues lets new executables page in, 
etc. and makes up for even truly pathological buffer-cache behaviour while 
under heavy write load.  It's a pretty simple trick, too.

Another observation -- one from one of the Sprite guys, way back when, while
we were discussing some very anomalous LFS performance results I got with one
of my workstations -- was that using multiple partitions on a single disk
invalidated many of the basic assumptions underlying most studies of disk
performance.  He suggested in very strong terms that, like the Sprite team,
NeXT (the only workstation vendor who did this at the time) had made a very
good choice by using a single partition on all disks they shipped, and
ensuring that even swap I/O was handled through the filesystem (by swapping
to a file); the filesystem, after all, can do whatever kind of sophisticated
data placement it may be prone to do, but with "foreign" sources of I/O on
the same spindle, it is all too common to encounter pathological cases.

I have built almost all of my systems with a single partition since then and
I have seldom regretted it.  Occasionally I wish I had "subtree quotas" on
directories within the filesystem so that I could constrain, say, /var/log
from overflowing its bounds, but with the size of modern disks, I don't
even often want that too much any more.

I note that a few highly-tuned, for-pay filesystems do, in fact, come with
"subtree quotas" or their moral equivalent, as well as very strong
recommendations that one may run multiple spindles on one filesystem, but
that one not run multiple filesystems on one spindle.

-- 
 Thor Lancelot Simon	                                      tls@rek.tjls.com
   But as he knew no bad language, he had called him all the names of common
 objects that he could think of, and had screamed: "You lamp!  You towel!  You
 plate!" and so on.              --Sigmund Freud