Subject: Re: Program recovery using checkpointing
To: Jason Thorpe <thorpej@shagadelic.org>
From: Kamal R. Prasad <kamalpr@yahoo.com>
List: tech-kern
Date: 03/11/2005 07:05:45
--- Jason Thorpe <thorpej@shagadelic.org> wrote:
> 
> On Mar 11, 2005, at 6:24 AM, Kamal R. Prasad wrote:
> 
> > Ok -they are being ambitious in writing fds to
> disk
> > and recovering them. I am not that ambitious. I
> expect
> > the process that created a checkpoint to be still
> up
> > and running when a recovery is initiated. And it
> is
> > meant for program recovery -not process migration.
> 
> So how do you prevent restart when the process is no
> longer running?  

If you look at kern_checkpoint.c in the tar ball, line
770 (and later under case CKPT_LONGJMP), I construct
the pathname from p->p_comm and p->p_pid. So, unless
the user has altered checkpoint files, process B won't
be re-starting process A at a saved state. Further,
kern_exit.c in the patch removes all checkpoint files
on exit1(). Basically, I don't want to give any
control to the user in playing around with the
checkpoint files [that would be a security hole].

> Or, more to the point, how do you validate those
> file handles before 
> they are put back into use?
> 

If you look at imgact_elf.c lines 1181-1190, I am
skipping calling the routine that writes open fds to
checkpoint file aka elf_putfiles().

> If the process is still "running", and all of its
> file descriptors are 
> thus still perfectly valid, then why are the file
> handles in the 
> checkpoint file in the first place?
> 
They aren't -in my case. In dragonflybsd, they are
-and that is one pt where I differ from them.

> What happens if a process opens a file between "last
> checkpoint" and 
The process when it reverts to the last checkpoint
-would retain the newly opened file descriptor.

> "crash"?  What if it closes a file in between those
> operations?
> 
Then it is semantically correct for the file
descriptor to be closed. Its for the programmer to
presume that the codepath doesn't necessarily start at
main(), and could be the result of a longjmp(). i.e.
he is responsible for opening any files at a restore
pt if he has closed them further down the road before
reverting.

> I read the "paper" you posted.  It seems more like a
> data sheet... 
Im sorry -this is my first paper:-).

> there is a lot of "why" missing from it.  If you
> have answers for all 
> of these questions above, why didn't you address
> them in the paper?
> 
Some of these "why"s assuming the discussion lasts
-will likely end up in the paper:-).

regards
-kamal

> -- thorpej
> 
> 

------------------------------------------------------------
Kamal R. Prasad
UNIX systems consultant 
kamalp@acm.org

In theory, there is no difference between theory and practice. In practice, there is:-).
------------------------------------------------------------

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com