A. Hattori, K. Yakusiji, T. Yokota, K. Ootsu, F. Furukama, and T. Baba (Japan)
Computational grids, Fault-tolerance, Message logging,Checkpointing, Recovery, Migration
Generally, a long-running application on a huge parallel and distributed computer system has a certain risk due to the increase of failure rate of the system. Therefore, fault tolerance technologies are required to build a reliable com putational grid system in practice. In this paper, we first propose a novel fault-tolerant system Eagle for computa tional grids. Eagle can tolerate simultaneous process fail ures. Furthermore, Eagle enables all processes in a do main to migrate to another domain. Second, we evaluate both basic communication performance and practical over heads of MPICH-EG, which is an implementation of Ea gle for MPI, by using a microbenchmark and NAS Parallel Benchmarks (NPB). We also discuss checkpointing meth ods and evaluate their overheads by using a checkpointer called ckpt.
Important Links:
Go Back