Recent trends in high-performance computing point towards increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean-time-between-failures (MTBF) ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. We present results from a computational chemistry application running at scale to show that our techniques provide applications with a high degree of fault tolerance and low (2%--4%) overhead for 2048 processors.
Revised: March 28, 2011 |
Published: February 9, 2011
Citation
Ali N., S. Krishnamoorthy, N. Govind, and B.J. Palmer. 2011.A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models. In Proceedings of the19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2011), February 9-11, 2011, Ayia Napa, Cyprus, 24-31. Los Alamitos, California:IEEE Computer Society.PNNL-SA-75835.