September 2, 2019
Conference Paper

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Abstract

In this paper, we extend an existing runtime system (the TensorFlow runtime) to enable automatic concurrency control and scheduling of operations. We explore performance modeling to predict the performance of operations with various thread- level parallelism. Our performance model is highly accurate and lightweight. Leveraging the performance model, our runtime system employs a set of scheduling strategies that co-run opera- tions to improve hardware utilization and system throughput. Our runtime system demonstrates a significant performance benefit. Comparing with using the recommended configurations for concurrency control and operation scheduling in TensorFlow, our approach achieves 36% performance (execution time) im- provement on average (up to 49%) for four neural network models, and achieves high performance close to the optimal one manually obtained by the user.

Revised: February 10, 2021 | Published: September 2, 2019

Citation

Liu J., D. Li, G. Kestor, and J.S. Vetter. 2019. Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training. In IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), May 20-24, 2019, Rio de Janeiro, Brazil, 188-199. Los Alamitos, California:IEEE Computer Society. PNNL-SA-141283. doi:10.1109/IPDPS.2019.00029