tech-kern: Re: Program recovery using checkpointing

Subject: Re: Program recovery using checkpointing
To: None <tech-kern@netbsd.org>
From: Christos Zoulas <christos@tac.gw.com>
List: tech-kern
Date: 03/11/2005 16:18:19

In article <200503111933.j2BJXA123034@srapc342.sra.co.jp>,
SODA Noriyuki  <soda@sra.co.jp> wrote:
>>>>>> On Fri, 11 Mar 2005 08:13:43 -0800 (PST),
>	"Kamal R. Prasad" <kamalpr@yahoo.com> said:
>
>>> IMHO, this work focuses on very limited application,
>
>> Its application is program recovery 
>
>But existing checkpointing systems already provide program recovery
>without the feature that the patch provides.
>
>> -but the types of userland applications (if that is what you meant)
>> which can use this feature is not limited.
>
>My point is that I don't think that the feature is useful.
>
>> If you want to use checkpointing for process
>> migration, that would require a substantial amt. of
>> work on checkpointing the kernel side of the process
>> aka the file descriptors, sockets, pipes etc..
>
>Actually, there are already some checkpointing systems which do
>provide process migration of processes which do network communication
>(e.g. MPI).  And such systems don't need the kernel support like what
>you are expecting.
>If you have any interest how they implement it, please read the source
>code of the following free software, for example:
>http://www.pccluster.org/score/dist/index.php

I have to agree with Soda here. I the class of failures the
implemented checkpointing system addresses is very limited, and a
more general checkpointing mechanism is much more useful. I.e. I
would like to be able to recover after a crash, or after a network
outage. It is not very frequently that programs that core-dump can
still work reliably after they have core-dumped once. I would even
argue that programs that core-dump should be fixed, and the use of
checkpointing in this case only serves as a band-aid which prevents
the wound from healing.

christos