Future HPC systems with ever-increasing resource
capacity (such as compute cores, memory and storage) may
significantly increase the risks on reliability. Silent data corruptions
(SDCs) or silent errors are one of the major sources
that corrupt HPC execution results. Unlike fail-stop errors,
SDCs are rather harmful and dangerous in that they cannot be
detected by hardware. We propose an online machine-learning
based silent data corruption detection framework (abbreviated
as MACORD) for detecting SDCs in HPC applications. In
particular, we comprehensively investigate the prediction ability
of a multitude of machine-learning algorithms in our study,
and enable the detector to automatically select the best-fit
algorithms at runtime to adapt to the data dynamics. Our
learning framework exhibits low memory overhead (less than
1%), since it takes only spatial features (i.e., neighboring data
values for each data point in the current time step) into
the training data. Experiments based on real-world scientific
applications/benchmarks show that our framework can get the
detection sensitivity (i.e., recall) up to 99% while the false
positive rate is limited to 0.1% in most cases, which is one order
of magnitude improvement compared with the latest state-of-art
spatial technique.
Revised: June 14, 2019 |
Published: September 5, 2017
Citation
Subasi O., S. Di, P. Balaprakash, O. Unsal, J. Labarta, A. Cristal, and S. Krishnamoorthy, et al. 2017.MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection. In IEEE International Conference on Cluster Computing (CLUSTER 2017), September 5-8, 2017, Honolulu, HI, 717-724. Los Alamitos, California:IEEE Computer Society.PNNL-SA-128115.doi:10.1109/CLUSTER.2017.128