Subject: Process checkpointing
To: None <tech-kern@NetBSD.ORG>
From: Dr. Lex Wennmacher <wennmach@geo.Uni-Koeln.DE>
List: tech-kern
Date: 01/26/1998 15:32:49
Occasionaly, when I have to notify a user about a necessary shutdown I'm
looking into sad eyes.  "Duhh. Another 200 Teracycles lost."
Our users often run large numbercrunching jobs, models in N dimensions.
Mostly, these jobs can not be continued once interrupted (although some users
try to implement restartability in their code).

My proposal is to add checkpointing capabilities to NetBSD. I thought about
implementing checkpointing as a new system call 'chkpoint' (to my knowledge
IRIX 6.2 has implemented something similar; I don't know details, though). A
process designed to be restartable would then simply install a signal handler
which would invoke 'chkpoint' when SIGXCPU (or a signal alike) is delivered.
Maybe even a new signal could be added: SIGCHKP.

>From discussions with Ignatios Souvatzis and Christoph Badura I am aware
of the inherent problems of checkpointig: open files (vnode -> filename
problem), initialization of devices (initialization history lost), shared
memory segments and so on.
Most of the checkpointing can probably be made in userland, similar to the
Condor batchsystem (
implementation or the 'save_world' routines by Bennet Yee
(Hmm. No new syscall? Shouldn't I have submitted this proposal to tech-kern
in the first place?)

One additional advantage of checkpointing would be that processes could be
migrated from one system to an other (Condor uses that).

I'm not a kernel hacker, and therefore I will stop here and leave further
discussions to the experts. I hope that enough of you find checkpointing
appealing, so that it will eventually be implemented.

Dr. Alexandre Wennmacher
Institut fuer Geophysik und Meteorologie         wennmach@geo.Uni-Koeln.DE
Universitaet zu Koeln                            phone  +49 221 470 - 3387
D-50923 Koeln                                    fax    +49 221 470 - 5198