Multi-fault Tolerance for Cartesian Data Distributions
Ali N, S Krishnamoorthy, M Halappanavar, and JA Daily. 2013. "Multi-fault Tolerance for Cartesian Data Distributions." International Journal of Parallel Programming 41(3):469-493. doi:10.1007/s10766-012-0218-5
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale sys- tems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modications to the algorithm to recover from faults with lower over- heads than replicated storage and a signicant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algo- rithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches as- sume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle corre- lated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several al- ternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent fail- ures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the pro- posed approach with minimal overhead.