In this paper, we proposed a novel clustering technique for tapping into the performance potential of a largely ignored type of locality: inter-CTA locality. We first demonstrated the capability of the existing GPU hardware to exploit such locality, both spatially and temporally, on L1 or L1/Tex unified cache. To verify the potential of this locality, we quantified its existence in a broad spectrum of applications and discussed its sources of origin. Based on these insights, we proposed the concept of CTA-Clustering and its associated software techniques. Finally, We evaluated these techniques on all modern generations of NVIDIA GPU architectures. The experimental results showed that our proposed clustering techniques could significantly improve on-chip cache performance.
Revised: April 27, 2017 |
Published: April 8, 2017
Citation
Li A., S. Song, W. Liu, X. Liu, A. Kumar, and H. Corporaal. 2017.Locality-Aware CTA Clustering For Modern GPUs. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017), April 8–12, 2017, Xi'an, China, 297-311. New York, New York:ACM.PNNL-SA-123050.doi:10.1145/3037697.3037709