RAS and Job Log Data Analysis for Failure Prediction for the IBM Blue Gene/L

L.D. Solano-Quinde and B.M. Bode (USA)

Keywords

Fault Tolerance, IBM Blue Gene/L, Large Scale Systems

Abstract

Currently, the computational needs of scientific applications have grown to levels where it is necessary to have computers with a very high degree of parallelism. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance. However, failures in this large system are a major concern, since it has been demonstrated that a failure will drastically decrease the performance of the system. Checkpointing and log schemes have been utilized to overcome these failures, however, it has been shown that these techniques are not as effective as desired. Therefore, proactive failure detection and prediction has gained interest in the research community. In this study, we have collected the RAS event and Job logs from a large IBM Blue Gene/L over a three-month period. We have investigated the relationship among fatal and non-fatal events with the aim of proactive failure prediction. Based on our observations, we have developed a scheme for predicting fatal events based on the spatial and temporal relation among fatal and non fatal events. We will show that with our scheme up to 84% of fatal events could be effectively predicted.

Important Links:



Go Back