Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on GPUs requires tackling several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. GPU-enabled code is generated for the most expensive contractions in CCSD(T) and incorporated into NWChem, a popular computational chemistry suite. We demonstrate speedup over a factor of 8.4 using one core per node and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores. We finally analyze the behavior of the application on future GPU systems.
Revised: November 10, 2010 |
Published: September 20, 2010
Citation
Ma W., S. Krishnamoorthy, O. Villa, and K. Kowalski. 2010.Acceleration of Streamed Tensor Contraction Expressions on GPGPU-based Clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 2010), 207-216. Piscataway, New Jersey:Institute of Electrical and Electronic Engineers.PNNL-SA-73012.doi:10.1109/CLUSTER.2010.26