June 18, 2007
Conference Paper

Transparent System-level Migration of PGAs Applications using Xen on Infiniband

Abstract

Abstract—Checkpoint-Restart is considered one of the most natural approaches to achieve fault-tolerance in a highperformance cluster. While early research experiences have focused their attention on user-level solutions, the advent of efficient system-level virtualization software, such as Xen and VMWare, has opened the door to the possibility of efficient and scalable cluster-level virtualization. In this paper we present an innovative approach to cluster fault-tolerance by integrating the Xen virtualization with the latest generation of the Infiniband network. A major contribution of this paper is the automatic identification of global recovery lines to freeze the status of the machine. Our focus is on the partitioned global address space (PGAS) programming model. PGAS models has been receiving an increasing amount of attention in the recent years. We have developed global coordination mechanisms and deployed it in the the ARMCI one-sided communication library that has been used as a run-time system for several PGAS models. The experimental results show that it is possible to virtualize the communication and the computation with minimal overhead and to provide seamless migration capabilities.

Revised: January 17, 2011 | Published: June 18, 2007

Citation

Scarpazza D.P., P. Mullaney, O. Villa, F. Petrini, V. Tipparaju, D.M. Brown, and J. Nieplocha. 2007. Transparent System-level Migration of PGAs Applications using Xen on Infiniband. In 2007 IEEE International Conference on Cluster Computing, 74-83. Piscataway, New Jersey:IEEE. PNNL-SA-55723. doi:10.1109/CLUSTR.2007.4629219