Blue Waters User Portal | Science Teams

Detecting Silent Data Corruptions in Exascale Applications

Marc Snir, University of Illinois at Urbana-Champaign

Usage Details

Marc Snir, Chen Wang, Jinghan Sun, Omri Mor, Shibi He

As we move toward exascale platforms, silent soft hardware errors are likely to become more frequent, for a variety of reasons: Larger number of components; the use of cheaper hardware that has less error detection logic, and the use of sub-threshold logic, in order to reduce energy consumption.

This has motivated a significant amount of research on Algorithm Based Fault Tolerance (ABFT), where algorithm-specific methods are used to detect errors in intermediate data and repair them. The main disadvantage of these methods is that they are algorithm- specific and require a knowledge of the properties of the numerical method used.

Error detection is the critical aspect of ABFT. Attempts have been made to use generic error detectors that would work for a large family of algorithms. These detectors ignore flips of less significant bits that will not lead to an incorrect answer and focus on detecting large errors; they can detect such errors with high precision and recall, but only if run immediately after an error has been injected: The error manifest itself as a large gap between a point value and the value of neighbors or between current and previous point value. The need to run the detector at each iteration significantly reduce its usefulness.

We focus in our research on iterative methods. With such methods, errors of a small magnitude will usually have a tolerable impact on the final result; but errors of a large magnitude (e.g., a bit flig in a floating point number exponent) may cause the algorithm not to converge or to converge to a wrong answer.

We are interested in developing detectors that can detect errors multiple iterations after they were injected. While the successive iterations will have smoothed the initial error, we hypothesize that a detector can detect the spatial pattern created by the error propagation. We shall attempt to build such a detector by training a Convolutional Neural Network (CNN). The CNN will be specific to the iterative method used, but independent of initial conditions and mesh size. While such detector would be specific to one algorithm, it could be created in a mostly automated fashion: The training set is created by running the algorithm with multiple initial conditions, and injecting error in random locations after a random number of iterations. Furthermore, for most iterative methods, errors propagate locally from a point to its neighbors, so that the detection algorithm can be local, too.