March 5, 2025
Conference Paper

An Efficient Checkpointing System for Large Machine Learning Model Training

Abstract

As machine learning models increase in size and complexity rapidly, the cost of checkpointing in ML training became a bottleneck in storage and performance (time). For example, the latest GPT-4 model has massive parameters at the scale of 1.76 trillion. It is highly time and storage consuming to frequently writes the model to checkpoints with more than 1 trillion floating point values to storage. This work aims to understand and attempt to mitigate this problem. First, we characterize the checkpointing interface in a collection of representative large machine learning/language models with respect to storage consumption and performance overhead. Second, we propose the two optimizations: i) A periodic cleaning strategy that periodically cleans up outdated checkpoints to reduce the storage burden; ii) A data staging optimization that coordinates checkpoints between local and shared file systems for performance improvement.

Published: March 5, 2025

Citation

Xu W., X. Huang, S. Meng, W. Zhang, L. Guo, and K. Sato. 2024. An Efficient Checkpointing System for Large Machine Learning Model Training. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 17-22, 2024, Atlanta, GA, 896-900. Piscataway, New Jersey:IEEE. PNNL-SA-204755. doi:10.1109/SCW63240.2024.00127