December 17, 2018
Conference Paper

Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability

Abstract

Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present mathematical models of the checkpointing systems to quantify their reliability and availability. We perform trade-off analysis with respect to resource costs and reliability. Then, we explore the optimal checkpoint placement for checkpointing systems to maximize system availability. Finally, in a rigorous manner, we comparatively analyze the behavior of redundant systems where replication and repair mechanisms are employed. We postulate that the proposed models can aid system designers, who can instantiate our models to assess and quantify the availability and reliability of systems of interest. Our study demonstrates that the configuration with most reliable and available systems depends on the parameter settings.

Revised: May 22, 2019 | Published: December 17, 2018

Citation

Subasi O., R. Tipireddy, and S. Krishnamoorthy. 2018. Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability. In IEEE 25th International Conference on High Performance Computing (HiPC 2018), December 17-20, 2018, Bengaluru, India, 183-192. Los Alamitos, California:IEEE Computer Society. PNNL-SA-138205. doi:10.1109/HiPC.2018.00029