December 17, 2018
Conference Paper

Characterization of the Impact of Soft Errors on Iterative Methods

Abstract

Soft errors caused by transient bit flips have the potential to significantly impact an application’s behavior. This has motivated the design of an array of techniques to detect, isolate, and correct soft errors using microarchitectural, architectural, compilation-based, or application-level techniques to minimize their impact on the executing application. The first step towards the design of good error detection/correction techniques involves an understanding of an application’s vulnerability to soft errors. In this paper, we present the first comprehensive characterization of the impact of soft errors on the convergence characteristics of six iterative methods using application-level fault injection. In particular, we consider the use of iterative methods to incrementally solve a linear systems of equations, which constitutes the core kernel in many scientific applications. We analyze the impact of soft errors in terms of the type of error (single- vs multi-bit), the distribution and location of bits affected, the data structure and the statement impacted, and variation with time. In addition to understanding the vulnerability of iterative solvers to soft errors, this characterization can aid the design of fault injection campaigns that ensure systematic coverage.

Revised: May 15, 2019 | Published: December 17, 2018

Citation

Mutlu B., G.G. Kestor, J.B. Manzano Franco, O. Unsal, S. Chatterjee, and S. Krishnamoorthy. 2018. Characterization of the Impact of Soft Errors on Iterative Methods. In 25TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HiPC 2018), December 17-20, 2018, Bengaluru, Inda, 203-214. Los Alamitos, California:IEEE Computer Society. PNNL-SA-138072. doi:10.1109/HiPC.2018.00031