A Consistent Coordinated Checkpointing Protocol for Distributed Systems

S. Neogy; A. Sinha; P.K. Das

A Consistent Coordinated Checkpointing Protocol for Distributed Systems

S. Neogy, A. Sinha, and P.K. Das (India)

Keywords

checkpointing, recovery, checkpoint initiator, consistency, tentative checkpoint, coordinated checkpointing

Abstract

This paper describes a distributed coordinated checkpointing protocol that always ensures a consistent set of checkpoints. A checkpoint initiator initiates checkpointing activity and the protocol followed is two phase with each process maintaining a tentative checkpoint till it is made permanent or aborted. However, there is no central checkpoint initiator, but each of the processes takes turn to act as the initiator. Processes take local checkpoints only after being notified by the initiator. The guaranty that no message would be lost in the system (where processes communicate via messages only) in case of failure is maintained in this work by forcing processes to refrain from sending computation messages for a certain period of time that generally equals the time a message in a network takes to reach its destination from the sender. Processes carry out local computations only during that period that eventually gets included in the current permanent checkpoint.

Important Links:

DOI:
From Proceeding (439) Parallel and Distributed Computing and Systems - 2004

Go Back