Subject: HSM implementation proposal
To: None <tech-kern@NetBSD.ORG>
From: Ty Sarna <tsarna@endicor.com>
List: tech-kern
Date: 12/04/1997 23:31:16
I had some time that wasn't useful for much else tonight, so I sat down
and fleshed out some of the details of my HSM scheme.  (And I should
note that just because I have "a HSM scheme", I don't have plans to
implement it -- this was a just fun intellectual excercise.  Maybe
someone else will take it and run, though.  Or, I could be talked into
doing it for money.  I'm already working on too many things that don't
pay :->)

Here goes:

Add a flag to the process structure.  Let's call it the "unveiled" flag. 
It's inherited by child processes, and defaults to off.  (Optionally,
there may be a method for a process to turn off this flag if it inherits
it on and wants to give it up, but I don't believe it's necessary for
the scheme.) Since it defaults to off, and is inherited, and no way for
normal processes to set it, all normal processes are veiled. 

Reserve one more of the 4.4BSD file flags.  Let's call it the "fake"
flag.  It may be useful to reserve 1 or 2 other flags for the HSM
scheme, ("don't swap this file out of local storage", perhaps) but the
kernel need not care about them. 

hsmd opens a channel to the kernel. Let's say it's a character special
device, but it could be some other mechanism. The act of opening this
channel sets the unveiled flag for hsmd (which in turn is inherited by
processes acting on its behalf).

When any process performs actions on a file without the fake flag set (a
real file), things happen as now. Also, processes with the unveiled flag
set always do so, wether the fake flag is set or not. (ie, the have an
unveiled view of the filesystem, and see things as they really are). If
a normal process tries to open a file or directory with the fake flag
set, then things happen differently. If /dev/hsm is not open, then it
fails with an error. If there is something listening on /dev/hsm, it
gets sent a message telling it that there is an open attempt on the fake
file, and waits for a reply before continuing with the open. (or, if the
reply indicates an error, the open fails).

hsmd acts on the request to make the file real as follows:

If the file was a file, then the real, unveiled, on-disk file is
actually just an empty version of the file.  IE, the inode is there, but
the file is zero-length.  hsmd retrieves the data for the file from
wherever, writes it to the file, clears the fake flag, and replies
to the request.  The file is now real, and can be accessed as normal. 

If the file was a directory, then the real, unveiled, on-disk file is a
directory, but with no entries (except . and .., of course). hsmd
consults the offline stuff, figures out what files are in that
directory, and creats() them, and sets the fake flag on them, then
clears the fake file on the directory. Then it replies to the request --
now the directory is real and you can readdir() it or whatever, or open
a file in it (in which case the fake file is faulted in as described
above), or open a subdirectory in it (repeat this process).

Voila'! A complete system for automaticly swapping files in from the
offline storage to the local filesystem.  OK, there are a few more
details.  In order to stat() fake files without bringing in the data
(you'd hate to have to swap in all the files in a directory to do an ls
-l there), yet get the real length (remember the on-disk file is empty),
any stat-type operation on a fake file proceedes as normal, but before
the struct stat is returned to the calling process, it's handed to hsmd
to fill in the sizes). 

BTW, requests and replies from the kernel to hsmd should include a
request identifier, so that there can be multiple requests to hsmd
outstanding. And a well-written hsmd will probably want to be
multi-threaded, or at least fork() off to handle slow requests.

So, that's the swap-in part. Not too complicated.

The swap-out is even simpler, and happens almost entirely in userland. 
Whenever hsmd feels it's necessary, it spawns a cleaner process or
thread. The cleaner decides what needs to go, and moves it back to
offline.

To swap out a file, it sets the fake flag. If the file has been
modified, it writes it back. Then it truncates the file.

To swap out a directory, all files in it must be fake, or have a link
count > 1. Set the fake flag on the directory. If the contents of the
directory have been modified (files added or deleted), write the
directory listing back. Then unlink all the files in the directory.

That's that.  The one remaining nit is that hsmd is going to want to
know about files (including directories) that get deleted, so when the
link count of a file goes to 0, hsmd gets a message from the kernel with
a copy of the inode data, so it can update it's view of the world (most
things will key off the inode number, to get the "real" offline file
identifier).

That's the whole scheme. Very little kernel support needed -- all the
real work happens in userland, which makes it easier to debug and adapt
to different purposes or types of offline storage. The hardest part is
the management of the offline storage (which could vary quite a bit
between different environments -- NASA backing the disk with tape, while
in a environment with large numbers of more normal-usage users, you
might back the disk with CD jukeboxes, and clean less often (burning
changed data on CD's). That's one nice thing about doing all the hard
stuff in userland, it makes handling such diverse schemes a lot easier.

This scheme:

+ Does not require an inode on the local disk for every file in the
  entire hsm volume -- only for the ones currently on local disk and
  the fake files used to trigger swap-ins.

+ It isn't necessary for the entire directory tree of the HSM volume to
  be a represented on-disk. Whole subtrees can be swapped out. Or, you
  can keep them all local -- the policy for that is implemented in
  userland.

+ If you want, hsmd can use the fake files to store metadata about the
  real files -- instead of just having them 0-length, store your data.
  unveiled processes will be able to see and manipulate it to implement
  your scheme.

+ No modifications needed to fsck or dump.  In fact, this should work
  for any filesystem type that supports file flags. LFS, for instance.

- Entire files must be moved on or off local storage. Ie, it's a
  swapping rather than paging scheme. This is probably fine for most
  uses. If you need paging, you probably need to implement a whole new
  VFS type anyway and not layer it on top of UFS.

Comments?