Subject: Re: HSM implementation proposal
To: Ty Sarna <tsarna@endicor.com>
From: Jason Thorpe <thorpej@nas.nasa.gov>
List: tech-kern
Date: 12/04/1997 23:28:45
[ Note, I've only skimmed this, as I'm doing packing, etc. since I have
  to fly to DC for the IETF on Saturday morning, but I wanted to make a
  couple of comments... --thorpej ]

On Thu, 4 Dec 1997 23:31:16 -0600 (CST) 
 Ty Sarna <tsarna@endicor.com> wrote:

 > I had some time that wasn't useful for much else tonight, so I sat down
 > and fleshed out some of the details of my HSM scheme.  (And I should
 > note that just because I have "a HSM scheme", I don't have plans to
 > implement it -- this was a just fun intellectual excercise.  Maybe
 > someone else will take it and run, though.  Or, I could be talked into
 > doing it for money.  I'm already working on too many things that don't
 > pay :->)
 > 
 > Here goes:
 > 
 > Add a flag to the process structure.  Let's call it the "unveiled" flag. 
 > It's inherited by child processes, and defaults to off.  (Optionally,
 > there may be a method for a process to turn off this flag if it inherits
 > it on and wants to give it up, but I don't believe it's necessary for
 > the scheme.) Since it defaults to off, and is inherited, and no way for
 > normal processes to set it, all normal processes are veiled. 

Be warned, this is brainstorm spew...

So, from my quick skimming of this, I think I pretty much agree with
your scheme.  In fact, I've been thinking of things along very similar
lines.

The issue I disagree with is the use of the "unveiled" flag.  I think
that the process structure needs to stay out of this.  It doesn't really
have anything to do with the file system.

What I think would be more plausible is a passthrough vfs layer: hsmfs.
This would be sort of a nullfs-like thing.  For operations on resident
files, the ops just pass through to the lower layer.  For non-resident
files, the appropriate restoration occurs.

The key bit, here, is a database, that exists as a regular file on the
lower layer, that is always resident.  This database contains the additional
metadata for the files in the file system, indexed by inode number, or
whatever is appropriate.  (It might make sense to make this a ufs-specific
thin layer, because of the need to know fiel system specifics.)

Anyhow, this database is consulted each time an op is performed in the
hsmfs.  The hsmd (which does the file archiving, restoration, and the
truncation necessary to make a file non-resident) simply operates on the
file by accessing the lower layer directly (by using the real mount point,
which is protected from everyone else with directory permissions).  This
daemon would first lock the file by making a record in the database, which
would prevent other things from operating on the file via the hsmfs layer.
After it's done, it unlocks the file's database record.

Ok, so this probably sounds kind of scattered, but I hope you get the
idea... I wish I had a couple of hours, an audience, and a whiteboard
on which to flesh this out a bit... but alas, I have to get ready for a
trip.

Stay tuned, and Matt/Alan, make sure you get in touch w/ me about this
once I get back from DC.  :-)

 > 
 > Reserve one more of the 4.4BSD file flags.  Let's call it the "fake"
 > flag.  It may be useful to reserve 1 or 2 other flags for the HSM
 > scheme, ("don't swap this file out of local storage", perhaps) but the
 > kernel need not care about them. 
 > 
 > hsmd opens a channel to the kernel. Let's say it's a character special
 > device, but it could be some other mechanism. The act of opening this
 > channel sets the unveiled flag for hsmd (which in turn is inherited by
 > processes acting on its behalf).
 > 
 > When any process performs actions on a file without the fake flag set (a
 > real file), things happen as now. Also, processes with the unveiled flag
 > set always do so, wether the fake flag is set or not. (ie, the have an
 > unveiled view of the filesystem, and see things as they really are). If
 > a normal process tries to open a file or directory with the fake flag
 > set, then things happen differently. If /dev/hsm is not open, then it
 > fails with an error. If there is something listening on /dev/hsm, it
 > gets sent a message telling it that there is an open attempt on the fake
 > file, and waits for a reply before continuing with the open. (or, if the
 > reply indicates an error, the open fails).
 > 
 > hsmd acts on the request to make the file real as follows:
 > 
 > If the file was a file, then the real, unveiled, on-disk file is
 > actually just an empty version of the file.  IE, the inode is there, but
 > the file is zero-length.  hsmd retrieves the data for the file from
 > wherever, writes it to the file, clears the fake flag, and replies
 > to the request.  The file is now real, and can be accessed as normal. 
 > 
 > If the file was a directory, then the real, unveiled, on-disk file is a
 > directory, but with no entries (except . and .., of course). hsmd
 > consults the offline stuff, figures out what files are in that
 > directory, and creats() them, and sets the fake flag on them, then
 > clears the fake file on the directory. Then it replies to the request --
 > now the directory is real and you can readdir() it or whatever, or open
 > a file in it (in which case the fake file is faulted in as described
 > above), or open a subdirectory in it (repeat this process).
 > 
 > Voila'! A complete system for automaticly swapping files in from the
 > offline storage to the local filesystem.  OK, there are a few more
 > details.  In order to stat() fake files without bringing in the data
 > (you'd hate to have to swap in all the files in a directory to do an ls
 > -l there), yet get the real length (remember the on-disk file is empty),
 > any stat-type operation on a fake file proceedes as normal, but before
 > the struct stat is returned to the calling process, it's handed to hsmd
 > to fill in the sizes). 
 > 
 > BTW, requests and replies from the kernel to hsmd should include a
 > request identifier, so that there can be multiple requests to hsmd
 > outstanding. And a well-written hsmd will probably want to be
 > multi-threaded, or at least fork() off to handle slow requests.
 > 
 > So, that's the swap-in part. Not too complicated.
 > 
 > The swap-out is even simpler, and happens almost entirely in userland. 
 > Whenever hsmd feels it's necessary, it spawns a cleaner process or
 > thread. The cleaner decides what needs to go, and moves it back to
 > offline.
 > 
 > To swap out a file, it sets the fake flag. If the file has been
 > modified, it writes it back. Then it truncates the file.
 > 
 > To swap out a directory, all files in it must be fake, or have a link
 > count > 1. Set the fake flag on the directory. If the contents of the
 > directory have been modified (files added or deleted), write the
 > directory listing back. Then unlink all the files in the directory.
 > 
 > That's that.  The one remaining nit is that hsmd is going to want to
 > know about files (including directories) that get deleted, so when the
 > link count of a file goes to 0, hsmd gets a message from the kernel with
 > a copy of the inode data, so it can update it's view of the world (most
 > things will key off the inode number, to get the "real" offline file
 > identifier).
 > 
 > That's the whole scheme. Very little kernel support needed -- all the
 > real work happens in userland, which makes it easier to debug and adapt
 > to different purposes or types of offline storage. The hardest part is
 > the management of the offline storage (which could vary quite a bit
 > between different environments -- NASA backing the disk with tape, while
 > in a environment with large numbers of more normal-usage users, you
 > might back the disk with CD jukeboxes, and clean less often (burning
 > changed data on CD's). That's one nice thing about doing all the hard
 > stuff in userland, it makes handling such diverse schemes a lot easier.
 > 
 > This scheme:
 > 
 > + Does not require an inode on the local disk for every file in the
 >   entire hsm volume -- only for the ones currently on local disk and
 >   the fake files used to trigger swap-ins.
 > 
 > + It isn't necessary for the entire directory tree of the HSM volume to
 >   be a represented on-disk. Whole subtrees can be swapped out. Or, you
 >   can keep them all local -- the policy for that is implemented in
 >   userland.
 > 
 > + If you want, hsmd can use the fake files to store metadata about the
 >   real files -- instead of just having them 0-length, store your data.
 >   unveiled processes will be able to see and manipulate it to implement
 >   your scheme.
 > 
 > + No modifications needed to fsck or dump.  In fact, this should work
 >   for any filesystem type that supports file flags. LFS, for instance.
 > 
 > - Entire files must be moved on or off local storage. Ie, it's a
 >   swapping rather than paging scheme. This is probably fine for most
 >   uses. If you need paging, you probably need to implement a whole new
 >   VFS type anyway and not layer it on top of UFS.
 > 
 > Comments?

Jason R. Thorpe                                       thorpej@nas.nasa.gov
NASA Ames Research Center                            Home: +1 408 866 1912
NAS: M/S 258-6                                       Work: +1 650 604 0935
Moffett Field, CA 94035                             Pager: +1 415 428 6939