H.S. Paul, A. Gupta, and R. Badrinath (India)
Checkpoint and Rollback Recovery, Fault-tolerant Systems, Distributed Algorithm.
Coordinated checkpointing protocol is a simple and useful protocol, used for fault tolerance in distributed system on LAN. However, checkpoint overhead of the protocol is bottlenecked by the link speed. Checkpoint overhead of the protocol increases even if only one link in the network is of low-speed. In a metacomputing environment, where distributed application communicates over low speed WAN, the checkpoint overhead becomes very large. In this paper we present hierarchical coordinated checkpointing proto col which aims to overcome the network speed bottleneck. The protocol is based on the 2-phase commit protocol. The protocol is suitable for an internet-like network topology, where clusters of computers are connected via high speed link and the clusters are connected through low-speed links. Metacomputing environment runs over similar networks. We present simulation studies of the protocol, and it shows checkpoint overhead improvement over that of the well known coordinated checkpointing protocol.
Important Links:
Go Back