Analysis of Software-based Recovery Schemes for SMT Processors

L. Beyer, B. Fechner, and J. Keller (Germany)


Virtual Duplex System, SMT, recovery, roll-forward, roll back


Today’s microprocessors are prone to transient hardware faults caused by e.g. ionizing particles. The usual method to detect and correct such faults is to use duplex systems in software. Fault detection and correction can be ac celerated by taking advantage of logical processors avail able since the introduction of commercial SMT systems, e.g. by performing a simultaneous retry and roll-forward on different logical processors. We derive four differ ent recovery schemes ({probabilistic, deterministic} × {pessimistic, optimistic}), each of which can be applied after an error has been detected. The recovery software is modular and requires only minor extensions to existing code to provide protection. The schemes are tailored to be executed on an SMT processor. Their execution times are measured under the influence of transient faults, injected at rates of 10−5 , 1 4 · 10−5 and 10−6 . Depending on fault rate, checkpoint distance and the probability to correctly guess correct versions, we make recommendations about which variant to choose. An important insight is that a high rate of successful guesses p is needed for the probabilis tic schemes to provide significant advantage over the de terministic ones. When randomly choosing the version to roll-forward (p = 0.5), the optimistic deterministic variant is faster than the optimistic probabilistic one. With p = 0.7, the optimistic probabilistic variant begins to perform better than its deterministic counterpart. The comparison of pes simistic schemes yields similar results.

Important Links:

Go Back