December 3, 2024
Conference Paper

MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs

Abstract

Graph Neural Networks (GNN) are indispensable in learning from graph-structured data, yet their rising computational costs, especially on massively connected graphs, pose significant challenges in terms of execution performance. To tackle this, distributed-memory solutions such as partitioning the graph to concurrently train multiple replicas of GNNs are in practice. However, approaches requiring a partitioned graph usually suffer from communication overhead and load imbalance, even under optimal partitioning and communication strategies due to irregularities in the neighborhood minibatch sampling. This paper proposes practical trade-offs for improving the sampling and communication overheads for representation learn- ing on distributed graphs (using popular GraphSAGE architecture) by developing a parameterized prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework, demonstrating about 15–40% improvement in end-to-end training performance on the NERSC Perlmutter supercomputer for various OGB datasets.

Published: December 3, 2024

Citation

Sarkar A., S. Ghosh, N.R. Tallent, and A. Jannesari. 2024. MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs. In IEEE International Conference on Cluster Computing (CLUSTER 2024), September 24-27, 2024, Kobe, Japan, 62-73. Piscataway, New Jersey:IEEE. PNNL-SA-200893. doi:10.1109/CLUSTER59578.2024.00013