May 2, 2025
Conference Paper
Distributed-Memory Sparse Deep Neural Network Inference Using Global Arrays
Abstract
Partitioned Global Address Space (PGAS) models exhibit tremendous promise in developing efficient and productive distributed-memory parallel applications. They have been used extensively in scientific computations due to conveniently offering a ``shared-memory''-like model and convenient interfaces that separate communication with synchronization. Traditionally, PGAS communication models have been applied to dense/contiguously distributed data, but most modern applications depict varied levels of sparsity. Existing PGAS models require certain adaptations to support distributed sparse computations, since associated computations often require matrix arithmetic, in addition to data movement. The Global Arrays toolkit from Pacific Northwest National Laboratory (PNNL) is one of the earliest PGAS models to combine one-sided data communication and distributed matrix operations and is still used in the popular NWChem quantum chemistry suite. Recently, we have expanded the Global Arrays toolkit to support common sparse operations, like sparse matrix-dense matrix multiplies (SpMM), sparse matrix-sparse matrix multiplication (SpGEMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). As it turns out, these operations are the bedrock of sparse Deep Learning (DL); sparse deep neural networks and Graph Neural Networks (GNNs) have gained increasing attention recently in achieving speedups on training and inference with reduced memory footprints. Unlike scientific applications in High Performance Computing (HPC), modern (distributed-memory capable) DL toolkits often rely on non-standardized and closed-source vendor software optimizations, creating challenges in software-hardware co-design at scale. Our goal is to support a variety of distributed-memory sparse matrix operations and helper functions in the newly created Sparse Global Arrays (SGA), such that it is possible to build portable and productive Machine Learning scenarios for algorithm/software and hardware codesign purposes. Contemporary data-parallel schemes for training/inference are undergoing a major overhaul since model replication limits scalability and causes resource inefficiencies. As such, we have adopted tensor parallelism in decomposing the model and inputs, to mitigate memory issues. Current implementation is built on top of MPI and uses CPUs to maximize the portability across the platforms.Published: May 2, 2025