tech-kern: RE: Building an alternate backing store.

Subject: RE: Building an alternate backing store.
To: 'Jonathan Stone' <jonathan@DSG.Stanford.EDU>
From: Andrew Sporner <andy.sporner@networkengines.com>
List: tech-kern
Date: 07/14/2000 15:17:48
> 
> What kinds of answers are you hoping for?
> 

None now, but there was a time when I would post a question and
get no responses for some time or at all.  My guess is that I 
was not touching someone's sweet spot of interest as I obviously
have today with the DSM topic.  I know how I respond to some of
these lists but skimming the subject heading and deciding if it
is interesting enough.  I only caught this thread by accident 
and I am glad I did.

> There is a lot of prior art in this area, though not specifically with
> NetBSD.  Some of the tradeoffs and design choices have been fairly
> well explored.
> 
> Answers suitable for someone familiar with, say, the Sprite papers
> and some of the implementation (or even comparison with other
> contemporary distributed-system projects) won't be very useful
> if you're not familiar with that work.
> 

I agree and truthfully not being in the academic world I don't have
as much time for research into other projects as I would like.  I only
know vaguely about other projects.  

> 
> From that viewpoint, "Migrating processes using sockets" (paraphrase)
> isn't a completely clear description: if you migrate a process twice,
> does it depend on socketstate kept on both previous systems, or only
> on the original host system?  What crash-recovery operations are
> needed after failure, and how long do they take?

Sorry about the non-clarity.  The original thought was in moving a process
from one node to another for load balancing reasons, there is some
problems with this because of socket state, IPC resources and the 
filesystem.  The last case is easy enough for now with something like
NFS.  The first two are not as easy.  I solved the first one using a
Network Address translator living on a gateway.  All network traffic
from the outside goes through the gateway.  The idea is to create
an entry in the NAT when sockets are opened, re-hosted or closed. The
purpose of the gateway is to cause the packets associated with a 
socket to be able to find the socket within the cluster.  Shared memory
will be handled by extending the paging scheme.  As it turns out 
there is somebody with the same notion so my project will build on
his.

The idea of process movment is that all kernels in the cluster are
more or less equal and peer to peer.  When one kernel runs out of 
resources it can elect to send a process to another kernel that is
in better shape resource wise.  

Processes ID's in the cluster are created using a modulus principle
(this only applies to process ID's > 100 (userland stuff) which are
the only ones capable of relocation).  So that for instance the 
process ID's on node 1 all have a modulus of 1  (factored from the 
number of nodes in the cluster).  In this way there is no PID conflicts
in the cluster.  The creating node does the normal fork/exec to 
instantiate the process, but when it is moved, the original process
table entry remains with a new state and a new one is created on the
node accepting the process.  The new node gets only the minimal amount
of process information and "demand pages" the process into it's own
memory space.  Only after the "essence" of the process is loaded
on the accepting node is the original process table entry state changed.

Subsequent moves are handled by notifying the "mother node" of the 
process of the next move, but the node giving up the process (that is
not the mother node) does not keep the process entry.  The only reason
that the mother node does is to keep the process ID allocated so that
duplicates cannot occur.

> 
> Similarly, re DSM ("processes attached to shared memory"), what's
> your design target?  Release consistency, or something stronger?
> 

To be able to have an environment with the goals of NUMA (extending
the scalability of SMP) but eliminating the OS single point of 
failure.  In this environment the admitted worst scenario is to
loose a piece of shared memory that was on a node that has died that
all other processes cluster wide were attached to.  The advantage
is that the O/S does not need to be rebooted, only the application,
because such a failure would cause a core dumping action on all 
processes cluster wide that had memory pages on the dead node.

> Also, what's a "low latency interconnect"? Gigabit ethernet,
> or something like Myrinet?

TBD.  I suspect whatever is available at the time.