April 30, 2004
Conference Paper

SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems

Abstract

This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of the Cannon’s algorithm. It is suitable for clusters and shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, Linux-Myrinet) and shared memory systems (SGI Altix, Cray X1) demonstrate consistent performance advantages over ScaLAPACK pdgemm, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case on the SGI Altix, the new algorithm performs 20 times better than ScaLAPACK for a matrix size of 1000 on 128 processors. The impact of zero-copy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated.

Revised: September 5, 2007 | Published: April 30, 2004

Citation

Krishnan M., and J. Nieplocha. 2004. SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems. In 18th IEEE International Parallel and Distributed Processing Symposium, 70. Los Alamitos, California:IEEE Computer Society. PNNL-SA-45376. doi:10.1109/IPDPS.2004.1303000