As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the
major sources that corrupt the execution results of HPC applications without being detected.
In this work, we explore a set of novel SDC detectors, by leveraging epsilon-
insensitive support vector machine regression, to detect SDCs that occur in HPC applications. The key contributions are three fold. (1) Our exploration takes temporal, spatial and spatiotemporal features into account and analyzes different detectors based
on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3). Experiments with eight real-world HPC applications show that support vector machine
based detectors can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this paper.
Revised: August 13, 2019 |
Published: September 3, 2018
Citation
Subasi O., S. Di, L. Bautista-Gomez, P. Balaprakash, O. Unsal, J. Labarta, and A. Cristal, et al. 2018.Exploring The Capabilities of Support Vector Machines in Detecting Silent Data Corruptions.Sustainable Computing: Informatics and Systems 19.PNNL-SA-131767.doi:10.1016/j.suscom.2018.01.004