July 7, 2023
Conference Paper

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Abstract

Deep Learning Recommendation Models (DLRMs) are critical applications in various domains and have evolved as one of the single largest machine learning applications. Trillions of DLRM parameters exceed the on-chip memory capacity of GPUs. Large-scale multi-node systems are required for distributed DLRM inference and training, which suffer from the all-to-all communication bottleneck, mainly limiting the scalability of ever-growing DLRMs. In recent years, SmartNICs have evolved with coupled computation and communication capabilities providing opportunities for a powerful heterogeneous device in the system. However, there isn't such a distributed system that fully leverages the abundant smartNIC resources that resolve the scalability issue of DLRMs. In this work, we proposed a software-hardware co-design of a heterogeneous smartNIC system that resolves the communication bottleneck of distributed DLRMs, mitigates the memory bandwidth pressure, and improves computation efficiency. We provide a set of smartNIC designs of cache systems (including local cache and remote cache) and smartNIC computation kernels which reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches which optimizes the overall system performance with higher data reuse. Our evaluation shows that our system achieves 2.1x latency speedup for inference and 1.6x throughput speedup for training.

Published: July 7, 2023

Citation

Guo A., Y. Hao, C. Wu, P. Haghi, Z. Pan, M. Si, and D. Tao, et al. 2023. Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training. In Proceedings of the 27th International Conference on Supercomputing (ICS-2023) June 21-23, 2023, Orlando, FL, 336–347. New York, New York:Association for Computing Machinery. PNNL-SA-181666. doi:10.1145/3577193.3593724