This paper describes design and implementation of mechanisms for latency tolerance in the context of remote memory access communication on clusters equipped with high-performance networks such as Myrinet. It discusses protocols and strategies that bridge the gap between user-level requirements and low-level network-specific communication interfaces while attempting to increase opportunities for latency hiding. Specifically, mechanisms for overlapping communication with computation and coalescing small messages (trading latency for bandwidth) are explored. The effectiveness of these techniques is evaluated in the context of microbenchmarks and application kernels including the NAS parallel benchmark suite. The microbenchmark results showed better degree of overlap for nonblocking operations in ARMCI as compared to MPI. Application results showed upto 30-45% improvement over MPI on using nonblocking operations. The coalescing small message technique using aggregation gave performance improvement of up to 78% over non-aggregated communication.
Revised: July 13, 2011 |
Published: December 1, 2003
Citation
Nieplocha J., V. Tipparaju, M. Krishnan, G. Santhanaraman, and D.K. Panda. 2003.Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters. In IEEE International Conference on Cluster Computing, 138-147. Los Alamitos, New Mexico:IEEE Computer Society.PNNL-SA-39466.