Subject: Re: VPS mailing list, BSD interest?
To: Poul-Henning Kamp <phk@critter.tfs.com>
From: Jason Thorpe <thorpej@nas.nasa.gov>
List: tech-kern
Date: 10/01/1996 09:11:18
[ Keeping in mind, I haven't been thinking about mass storage for
  some time, and was hoping to keep my brain out of that mode, but
  whatever :-) ]

On Tue, 01 Oct 1996 09:18:47 +0200 
 Poul-Henning Kamp <phk@critter.tfs.com> wrote:

 > The problem I'm referring to is that this should not be done in a
 > pseudo-driver, but as a general framework for bdevs.
 > 
 > For instance, why can't I have my root-partition striped ?

I think a better question is "why would I _want_ my root partition striped?"
:-)

(The real answer to your question is "Becuse then you've added unnecessary
clutter to the ccd configuration code to deal with both statically-
and synamically-configured ccds".  In my mind, saying that your
MUST WORK AT ALL COSTS fileystem isn't allowed to be striped is an
acceptable trade-off :-)

 > There is no significant difference between the FDISK, bsd-disklabel,
 > mirror, stripeing and raid 5 operations.  They all translate a
 > (dev+blkno+len) tupple to one or more similar tupples.

...True, but the way they're translated makes a world of difference.
In the case of mirroring, you're translating a <dev+blk+len> into
multiple <dev+blk+len> for writes, and for reads, you want to find
the least-busy component, attempt the read, and then retry with
another component if that read fails (indeed, you want to continue
trying until you're out of living components).

The vast majority of the code in the ccd is dealing with configuration
(looking up the components, constructing the interleave table, etc.)
The actual translation code is small ...

The same is true of the mirror driver I started (but never finished).
It was mostly configuration, though the translation code was a bit
more complicated due to "mirroring on writes, read from least busy
with error recovery" semantics.  The mirror driver also, by design,
doesn't support disklabels (doesn't make any sense, really; perhaps
I just want to mirror a single partition).  In short, the semantics
of "tupple translation" are vastly different from ccd, and on planet 9
compared to regular partition translation.

Smashing the configuration of those two drivers (I'd actually rather
call them `layers') together would be silly, because they have
different configuration needs.

In my little world, the right way to get mirroring + striping
is to either:

	- make several 2- or 3- (or N-)way `mirror disks' and use
	  those mirror disks as components for a ccd.

	- make 2, 3 (or N) identical ccds, and use those
	  as the components of a `mirror disk'.

...depending on the behavior you want (probably the former).

It's not clear there's any real architecural benefit from creating
a generic framework for doing this sort of tranlation.  In fact, I
see at least one very negative outcome: you slow down and bloat up
the simple case of partition translation (which, as it stand now,
is very fast, and very simple).

Having worked with IRIX's logical volume stuff, the principle
of KISS was high on my list when doing the ccd work :-)

Jason R. Thorpe                                       thorpej@nas.nasa.gov
NASA Ames Research Center                               Home: 408.866.1912
NAS: M/S 258-6                                          Work: 415.604.0935
Moffett Field, CA 94035                                Pager: 415.428.6939