A Checkpoint using Loose Synchronization in Distributed Systems

I.K. Takeuchi and Y. Nakayama (Japan)


Checkpointing, Fault Tolerance, Distributed systems, Message Passing


The checkpoint protocols have been proposed for distributed systems in which processes of a parallel program run with message passings. In this paper, we propose a new checkpoint protocol which uses a loose synchronization instead of a strict synchronization that is used in some checkpoint-based systems. The loose synchronization of this protocol does not need to block execution of a parallel program while checkpointing. A system with this protocol only needs to log a minimum amount of messages ex changed between processes during the loose synchronization period before and after checkpoints. A checkpoint system with this protocol acquires certain beneļ¬ts in terms of low overhead in failure-free execution, simplicity of recovery algorithm and garbage collection and ease of implementation. We implemented the simple checkpoint system with this protocol and evaluated its effectiveness in experiments.

Important Links:

Go Back