tech-perform: Re: SGI disk sorting algorithm, thoughts on disksort() lossage

Subject: Re: SGI disk sorting algorithm, thoughts on disksort() lossage
To: None <tls@rek.tjls.com>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-perform
Date: 06/21/2002 14:23:07
In message <20020621205528.GA11096@rek.tjls.com>Thor Lancelot Simon writes
>I can't find the whitepaper.  Here's what I could find, from
>http://www.sgi.com/developers/feature/2001/roadmap.html:
>
>> Disk sorting in 6.5.8
>> 
>> Previously, all disk requests were sorted by block
>> number. 


[...]
>> In 6.5.8, the queues are split. Doing this permits queuing delayed writes
>> into one queue, while synchronous writes and reads are entered into another
>> queue. In 6.5.8 the disk driver will alternate between queues. This ensures
>> that a large queue of delayed write requests will not adversely impact
>> interactive response. If both delayed writes and other requests are pending,
>> the driver will alternate between them, issuing several delayed writes,
>> then several of the other requests. Selecting several from each queue each
>> time, rather than just one from each queue each time, makes sequential
>> I/O faster and disk performance is maximized.

>I can attest that this makes a truly surprising number of "system appears
>to hang while under heavy I/O load" problems go away -- alternating between
>the read/sync write and delayed-write queues lets new executables page in, 
>etc. and makes up for even truly pathological buffer-cache behaviour while 
>under heavy write load.  It's a pretty simple trick, too.

Obvious caveat: if this trick is used with a UBC-style system which is
immature or poorly-tuned (or even de-tuned by the user), then we can
still hit Manuel's scenario, The difference is that with the two-queue
trick, the xterm running top would get almost exactly twice the
page-in[ish] service rate it does now.

Which still sounds ... unusable, basically.


[... Sprite/NeXTt story: dont put more than one partition on a spindle]
Which Sprite people?

Right around that same time, I was using mt.xinu more/bsd and 4.3-Reno
on 680[23]0s and vaxes, IIRC, the cost of eating more CPU cycles from
already-slow CPUs, dominated any gains from single-partition disks.
My own workloads were happiest on machines with multiple filesystems
(swap partition, plus one or more user filesystem) on each disk. We
did carefully lay out the swap in the middle, tho, and the SCSI disks
were fast for their day. Compared to RA-8x, anyway.

>I have built almost all of my systems with a single partition since then and
>I have seldom regretted it.  Occasionally I wish I had "subtree quotas" on
>directories within the filesystem so that I could constrain, say, /var/log
>from overflowing its bounds, but with the size of modern disks, I don't
>even often want that too much any more.
>
>I note that a few highly-tuned, for-pay filesystems do, in fact, come with
>"subtree quotas" or their moral equivalent, as well as very strong
>recommendations that one may run multiple spindles on one filesystem, but
>that one not run multiple filesystems on one spindle.

fwiw, I know IBM's logical volume-manager thingy for AIX made I/O
really *glacially* slow if one ignored that advice.