March 1, 2013
Journal Article

Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution

Abstract

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU (instead of one core per node) and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). Finally, we analyze the implementation behavior on future GPU systems.

Revised: April 19, 2013 | Published: March 1, 2013

Citation

Ma W., S. Krishnamoorthy, O. Villa, K. Kowalski, and G. Agrawal. 2013. Optimizing Tensor Contraction Expressions for Hybrid CPU-GPU Execution. Cluster Computing 16, no. 1:131-155. PNNL-SA-79187. doi:10.1007/s10586-011-0179-2