Many-core Graphics Processing Units (GPUs) have been utilized as the computation engine in many scientific fields due to the high peak performance, cost effectiveness, and the availability of user friendly programming environments, e.g., NVIDIA CUDA. However, the conventional data parallel GPU programming paradigm cannot satisfactorily address issues such as load balancing and GPU resource utilization due to the irregular and unbalanced workload patterns exhibited in some applications. In this paper, we explore the design space of task-based solutions for multi-GPU systems. By employing finer-grained tasks than what is supported in the current CUDA, and allowing task sharing, our solutions enable dynamic load balancing. We evaluate our solutions with a Molecular Dynamics application with different atom distributions (from uniform distribution to highly non-uniform distribution). Experimental results obtained on a 4-GPU system show that, for non-uniform distributed systems, our solutions achieve excellent speedup, and significant performance improvement over other solutions based on the standard CUDA APIs.
Revised: November 30, 2011 |
Published: September 25, 2011
Citation
Chen L., O. Villa, and G.R. Gao. 2011.Exploring Fine-Grained Task-based Execution on Multi-GPU Systems. In IEEE International Conference on Cluster Computing (CLUSTER 2011), September 26-30, 2011, Austin, Texas. Los Alamitos, California:IEEE Computer Society.PNNL-SA-70335.doi:10.1109/CLUSTER.2011.50