I.K. Takeuchi and Y. Nakayama (Japan)
Checkpointing, Fault Tolerance, Distributed systems, Message Passing
The checkpoint protocols have been proposed for distributed systems in which processes of a parallel program run with message passings. In this paper, we propose a new checkpoint protocol which uses a loose synchronization instead of a strict synchronization that is used in some checkpoint-based systems. The loose synchronization of this protocol does not need to block execution of a parallel program while checkpointing. A system with this protocol only needs to log a minimum amount of messages ex changed between processes during the loose synchronization period before and after checkpoints. A checkpoint system with this protocol acquires certain beneļ¬ts in terms of low overhead in failure-free execution, simplicity of recovery algorithm and garbage collection and ease of implementation. We implemented the simple checkpoint system with this protocol and evaluated its effectiveness in experiments.
Important Links:
Go Back