A Proactive Fault Tolerance Framework for High-Performance Computing

A. Litvinova (UK), C. Engelmann, and S.L. Scott (USA)

Keywords

high-performance computing, fault tolerance, system monitoring, high availability, reliability

Abstract

As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures through preventative measures, such as by migrating application parts away from nodes that are “about to fail”. This paper presents a proactive FT framework that performs environmental monitoring, event logging, parallel job monitoring and resource monitoring to analyze HPC system reliability and to perform FT through such preventative actions.

Important Links:



Go Back