Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. While effective at small scale, centralized load balancing schemes quickly become a bottleneck on large-scale clusters. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
Revised: August 25, 2010 |
Published: November 14, 2009
Citation
Dinan J.S., D.B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. 2009.Scalable Work Stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Article No. 53. New York, New York:Association for Computing Machinery.PNNL-SA-67261.doi:10.1145/1654059.1654113