May 5, 2011
Conference Paper

Tolerating Correlated Failures for Generalized Cartesian Distributions via Bipartite Matching

Abstract

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. A key ingredient of any approach to fault tolerance is effective support for fault tolerant data storage. A typical application execution consists of phases in which certain data structures are modified while others are read-only. Often, read-only data structures constitute a large fraction of total memory consumed. Fault tolerance for read-only data can be ensured through the use of checksums or parities, without resorting to expensive in-memory duplication or checkpointing to secondary storage. In this paper, we present a graph-matching approach to compute and store parity data for read-only matrices that are compatible with fault tolerant linear algebra (FTLA). Typical approaches only support blocked data distributions with each process holding one block with the parity located on additional processes. The matrices are assumed to be blocked by a cartesian grid with each block assigned to a process. We consider a generalized distribution in which each process can be assigned arbitrary blocks. We also account for the fact that multiple processes might be part of the same failure unit, say an SMP node. The flexibility enabled by our novel application of graph matching extends fault tolerance support to data distributions beyond those supported by prior work. We evaluate the matching implementations and cost to compute the parity and recover lost data, demonstrating the low overhead incurred by our approach.

Revised: December 2, 2011 | Published: May 5, 2011

Citation

Ali N., S. Krishnamoorthy, M. Halappanavar, and J.A. Daily. 2011. Tolerating Correlated Failures for Generalized Cartesian Distributions via Bipartite Matching. In Proceedings of the 8th ACM International Conference on Computing Frontiers (CF 2011), May 3-5, 2011, Ischia, Italy. New York, New York:Association for Computing Machinery. PNNL-SA-76095. doi:10.1145/2016604.2016649