tech-kern: Re: Program recovery using checkpointing

Subject: Re: Program recovery using checkpointing
To: None <kamalp@acm.org>
From: Jason Thorpe <thorpej@shagadelic.org>
List: tech-kern
Date: 03/11/2005 06:53:34

On Mar 11, 2005, at 6:24 AM, Kamal R. Prasad wrote:

> Ok -they are being ambitious in writing fds to disk
> and recovering them. I am not that ambitious. I expect
> the process that created a checkpoint to be still up
> and running when a recovery is initiated. And it is
> meant for program recovery -not process migration.

So how do you prevent restart when the process is no longer running?  
Or, more to the point, how do you validate those file handles before 
they are put back into use?

If the process is still "running", and all of its file descriptors are 
thus still perfectly valid, then why are the file handles in the 
checkpoint file in the first place?

What happens if a process opens a file between "last checkpoint" and 
"crash"?  What if it closes a file in between those operations?

I read the "paper" you posted.  It seems more like a data sheet... 
there is a lot of "why" missing from it.  If you have answers for all 
of these questions above, why didn't you address them in the paper?

-- thorpej