December 20, 2024
Conference Paper

Benchmarking Variables for Checkpointing in HPC Applications

Abstract

Checkpoint/Restart (C/R) is a widely used fault tolerance mechanism in converged systems of cloud, edge, and HPC. However, users often rely on their experience to determine which variables to checkpoint, as there is currently no benchmark that can provide a reference. This can result in checkpointing redundant or even incorrect variables. To address this issue, we propose a benchmark suite that includes critical variables for checkpointing, which have been manually identified, and a method for identifying those critical variables, with 20 representative HPC applications. Our method involves analyzing data dependency between variables to identify critical variables analytically. We verify the identified variables' correctness with a widely used C/R library FTI by an ablation study. With our benchmark suite and data dependency analysis, HPC practitioners now have a reference for identifying checkpointing variables and better knowledge of what kind of variables to checkpoint.

Published: December 20, 2024

Citation

Fu X., X. Huang, W. Xu, W. Zhang, S. Meng, L. Guo, and K. Sato. 2024. Benchmarking Variables for Checkpointing in HPC Applications. In IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2024), May 27-31, 2024, San Francisco, CA, 406-413. Piscataway, New Jersey:IEEE. PNNL-SA-204757. doi:10.1109/IPDPSW63119.2024.00090

Research topics