The increasing soft error rates in memory systems raise an emerging concern for modern computing systems. As a result, detectable but uncorrectable errors (DUEs) become potentially more frequent and affect HPC applications. Today, upon encountering a DUE, HPC applications crash, incurring significant performance, storage, and energy overheads. In this paper, we propose a technique to continue application execution past a DUE through the repair of the corrupted memory data by leveraging spatial data smoothness. We present BonVoision, a run-time system that intercepts DUE events, analyzes the binary to identify data elements in the structural neighborhood of the event, and fixes the corrupted data elements by interpolating from the values in their neighborhood. Our evaluation demonstrates that BonVoision incurs negligible overhead and outperforms other recovery strategies by a factor of 2×, on average. We demonstrate that BonVoision also improves the efficiency of existing checkpointing/restart schemes by approximately increasing the optimal checkpoint interval by 23%.
Revised: November 26, 2019 |
Published: June 26, 2019
Citation
Fang B., H. Halawa, K. Pattabiram, M. Ripeanu, and S. Krishnamoorthy. 2019.BonVoision: Leveraging Spatial Data Smoothness For Recovery From Memory Soft Errors. In Proceedings of the ACM International Conference on Supercomputing (ICS 2019), June 26-28, 2019, Phoenix, AZ, 484-496. New York, New York:ACM.PNNL-SA-143140.doi:10.1145/3330345.3330388