Subject: Re: Disk striping with NetBSD
To: Anthony DeLorenzo <gonzo@vex.net>
From: Miles Nordin <nordinm@Colorado.EDU>
List: port-sparc
Date: 07/21/2000 18:30:11
On Fri, 21 Jul 2000, Anthony DeLorenzo wrote:

> disk striping with NetBSD? 

ccd(4), raid(4)

The former is ultra-simple.  The latter may let you stripe the root
filesystem, which is rather fancy.

> having swap on two different drives / buses should give me some
> performance advantage, and the same thing with splitting / and /usr. 

This is true if you tend to write only one partition at a time.  But, if
you are hitting /, /usr, and swapping all at the same time, you may find
that it's more performant to use your two disks to separate /usr from
swap.

Your plans sound good to me---I advise you go for it.  I will spend the
rest of this email trying to depress you with the futility of adopting any
strategy whatsoever, leaving you with that feeling of powerless confusion
which holds together our silent brotherhood of sysadmins.


This is a big question.  Bigger than I thought.  I've been writing about
it for half an hour already, and I'm sick of it.  I guess what I'm trying
to get at, is this:  striping disks will help you get more data bandwidth
out of them, which is great if you are writing a single stream of
uncompressed video.  However, many people end up getting the biggest
pummeling from their disks' seek time.  If you care about seek time,
multiple disks are still good because they have separate heads which can
seek independently, increasing your ``seek bandwidth.'' But striped disks
tend to move their heads ``in sync,'' meaning to the same relative block
at the same time, more often than random chance.  This wastes ``seek
bandwidth,'' because it's more like having just one head assembley instead
of several.  Therefore, striping is not always good. 

I would also like to treat your suggestion of striping ``swap''
specifically.  I heard some early boasting from some Linux RAID guy who
said his machine really screamed because he striped his swap partition,
and wonder if that's what's inspiring you.  I think it was in their kernel
'Changes' file or something.  If so, I encourage you to follow his or her
advice as an experiment, a hypothesis to be tested, a fun learning
experience.  I remember when I read that---I thought the guy sounded more
like an excited sophomore with a PeeCee on his desk, not an experienced
sysadmin.  Granted, that's basically exactly what I am, but I still tend
to take filesystem layout advice from MCSE's, C:, D:, Q:, sundry ZIP-disk
backoffice coldfusion ecommerce flakeheads, with a grain of salt.  In that
spirit, I question the wisdom of designing your filesystem layout around
optimal swap performance.

 o When swap performance is your machine's rate-limiting step, you are
   probably experiencing 'churn'.  Churn is a pathological state in which
   programs are using memory so vigorously that your tertiary storage
   (swap) is getting accessed about as much as your secondary storage
   (SDRAM, FDPRAM, WINRAM, whatever chips those crazy kids are using).  If 
   you experience churn, _you obviously need to buy more RAM_, not beef up
   your disk subsystem.  It is important to understand that churn is
   _pathological_---PeeCee/Wintel users often don't get this, since
   Windows is a pathological operating system that seems to 'churn'
   randomly from time to time under normal use.  I used to be a
   PeeCee/Wintel user, and I remember being surprised to learn that churn
   is a _breakdown_ of the way virtual memory is supposed to work, not
   just a sign that the machine is ``working really hard.''

 o The proper use of swap space is often very low-bandwidth.  Typically,
   your machine gets itself going, maps a bunch of giant libraries into
   memory, executes some run-once initialization code that's inside each
   application, and settles into a comfortable rut.  After you're logged
   in for a day, all that initialization cruft and seldom-called
   library parts gets paged out to the swap partition.  Maybe some friend
   telnet's in and leaves his session idle for a day---his whole session
   gets paged out.  The Lisp era seems to be sadly over.  When swap is not
   the machine's rate-limiting step, it may be used very lightly indeed.

 o If you want better memory performance, buy more RAM.  Using swap less 
   is probably more effective than speeding up the disk it's on, and it
   may be cheaper, too.  But more RAM will seldom help the performance of
   your real filesystems.  Wasting spindles, controllers, and advanced
   layout strategies on swap is questionable, since there is a good
   alternative.


Now, on to more constructive advice.  Filesystems aspire to accomodate the
seek characteristics of mechanical disks.  FFS tries to make sure data
blocks belonging to files that are close in the directory tree are close
together on the disk, which usually leads to less seeking than haphazard
data block arrangement.  But, to make this optimization, FFS must presume
it has the entire disk to itself.

Example:

Alternative 1:
mount point      offset      len
Disk 1:
/home1           0           1000000
Disk 2:
/home2           0           1000000

Alternative 2:
ccd1: stripe of <Disk 1> and <Disk 2>
mount point      offset      len
/home            0           2000000

If you always have exactly one user logged in, Alternative 2 is a clear
win.  However, if you usually have 200 users logged in (real users, not
users that just login and leave a pine window open), Alternative 1 will
probably perform _better_ than 2, even though it's frustratingly primitive
and harder to sysadmin.  This is because Alternative 1 takes better
advantage of the two disks' separate head assembleys. 

Likewise, if you split your disk into nine slices for every subdirectory
in /usr, you enforce your own arrangement of data blocks and balk FFS's
optimization.  Granted, there is an FFS-ish spatial locality to the
nine-slices scheme just because mount points gather files close together
in the directory tree, but I think it's smarter to let the machine do the
work.  Example: 

Alternative 1:
mount point       offset     len
/                 0          10000
/usr              10000      300000

Alternative 2:
mount point       offset     len
/usr              0          300000
/                 300000     10000

FFS tries to cluster data blocks close together at the beginning of the
disk.  This clustering is a good idea because interspersing free space
among your files means that you often have to seek across it.  If a
filesystem is 10% full, you want all that 10% as close together as 
possible.

Alternative 2 is foolish, because /usr will probably be far from full.  As
the system runs, you will mix accesses to / and /usr, and every time the
head skips from one filesystem to another it will have to cross all the
empty space at the end of the /usr filesystem.  In Alternative 1, this is
not so---you need only seek across the much smaller free space in /. 
Similarly, splitting your disk into lots of tiny filesystems enforces the
spreading-out of free space, befuddling FFS's tendency to achieve an
optimal level of free-space fragmentation.

This free-space thing is an obscure problem.  There are more obscure
problems---for example, modern disks can be twice as fast at the beginning
as they are at the end.  The platter spins with CAV, the firmware uses
more sectors on the outer cylinders than the inner cylinders so there's a
constant linear information density, and hard disks do just the opposite
of CD's---Block 0 is on the outside.  This works well with FFS's tendency
to use low-numbered blocks first, so long as the bulk of the disk is one
big filesystem.  If you make nine partitions and accidentally put your
POP/IMAP server's /var/mail at the end of the disk, you may be shooting
yourself in the foot. 

Things are going to get weirder and weirder as we start using filesystems
that are fancier than FFS.  You can expect the ``I have the disk to
myself'' assumption to remain, since the strategies it suggests usually
help some even when the assumption is false.  Everything else is up for
grabs.

There's just no hope of anticipating such nonsense.  It's best to leave
these problems to filesystem designers who care about solving them.  For
now, it seems like the best way to do that is ``don't get too creative.''
One very uncreative strategy is, ``try to put high-traffic partitions on
separate disks.'' If you've done this, and you still have separate
spindles left over, then you might stripe the busiest partition. 

> If I [install X], I would imagine that I might have to make some
> changes. 

IMHO, putting X on a separate partition is silly.  The practice came from
an era where development proceeded very slowly.  /usr contained binaries
from your vendor, and you might update it once every three years or
something.  Between updates, it was a fixed size, never written, might as
well be mounted read-only.  Its contents were ``sacred'' since they came
from your vendor and were above change.  Unlike this new X11 thing,
``which is just some new-fangled nonsense from MIT---we'll probably end up
going with SunView in the end, and _maybe_ keep some of the X cruft in
/usr/local.  Since it's experimental, we better segregate it.''  uh-huh.

Things change quickly these days.  I always just make a gigantic /usr, a
/, and an /altroot.

The idea of splitting off /var and /home to avoid corrupting /usr appeals
to me, although I seldom do it myself because I get furious when I'm out
of space in one filesystem while another is nearly empty.


Then Coda throws a wrench into everything by separating the idea of
``volumes'' from that of ``filesystems.''  Have fun.

-- 
Miles Nordin / +1 720 841-8308