Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.
Revised: August 31, 2017 |
Published: July 3, 2017
Citation
Kestor G., S. Krishnamoorthy, and W. Ma. 2017.Localized Fault Recovery for Nested Fork-Join Programs. In Proceedings of the 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017), May 29-June 2, 2017, Orlando, Florida, 397-408. Los Alamitos, California:IEEE Computer Society.PNNL-SA-123481.doi:10.1109/IPDPS.2017.75