Checkpointing is the most widely used technique in
high-performance computing (HPC) to ensure the application
progress in the presence of failures. In this paper, we present
mathematical models of the checkpointing systems to quantify
their reliability and availability. We perform trade-off analysis
with respect to resource costs and reliability. Then, we explore
the optimal checkpoint placement for checkpointing systems to
maximize system availability. Finally, in a rigorous manner, we
comparatively analyze the behavior of redundant systems where
replication and repair mechanisms are employed. We postulate
that the proposed models can aid system designers, who can
instantiate our models to assess and quantify the availability and
reliability of systems of interest. Our study demonstrates that the
configuration with most reliable and available systems depends
on the parameter settings.
Revised: May 22, 2019 |
Published: December 17, 2018
Citation
Subasi O., R. Tipireddy, and S. Krishnamoorthy. 2018.Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability. In IEEE 25th International Conference on High Performance Computing (HiPC 2018), December 17-20, 2018, Bengaluru, India, 183-192. Los Alamitos, California:IEEE Computer Society.PNNL-SA-138205.doi:10.1109/HiPC.2018.00029