September 3, 2007
Conference Paper

Towards Fault Resilient Global Arrays

Abstract

The focus of the current paper is adding fault resiliency to the Global Arrays. We extended the GA toolkit to provide a minimal level of capabilities to enable programmer to implement fault resiliency at the user level. Our fault-recovery approach is programmer assisted and based on frequent incremental checkpoints and rollback recovery. In addition, it relies of pool of spare nodes that are used to replace the failing node. We demonstrate usefulness of fault resilient Global Arrays in application context.

Revised: July 12, 2010 | Published: September 3, 2007

Citation

Tipparaju V., M. Krishnan, B.J. Palmer, F. Petrini, and J. Nieplocha. 2007. Towards Fault Resilient Global Arrays. In Parallel Computing: Architectures, Algorithms and Applications: NIC Series, 38, 339-345. Julich:John von Neumann Institute for Computing. PNNL-SA-54426.