September 17, 2024
Conference Paper

OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model

Abstract

With the sharp increasing volume of user data, Deep Learning Recommendation Model (DLRM) becomes an indispensable infrastructure in large technology companies. However, large-scale DLRM on the multi-GPU platform is still inefficient due to unbalanced workload partitioning and intensive inter-GPU communication. To this end, we propose OPER, an OPtimality guided Embedding table placement for large-scale Recommendation model training and inference. OPER explores the potential of mitigating remote memory access latency in DLRM through fine-grained embedding table placement. Specifically, OPER proposes a theoretical modeling that builds up the relationship between EMT placement and the embedding communication latency in both training and inference. OPER proves the NP hardness of finding the optimal embedding table placement and proposes a heuristic algorithm that yields near optimal placement. OPER implements a SHMEM-based embedding table training system and a unified embedding index mapping to support fine-grained embedding table sharding and placement. Comprehensive experiments reveal that OPER achieves on average 3.4× and 5.1× speedup on training and inference respectively over state-of-the-art DLRM frameworks.

Published: September 17, 2024

Citation

Wang Z., Y. Wang, B. Feng, G. Huang, D. Mudigere, B. Muthiah, and A. Li, et al. 2024. OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model. In USENIX Annual Technical Conference, July 10-12, 2024, Santa Clara, CA, 667-682. Berkeley, California:USENIX: The Advanced Computing Systems Association. PNNL-SA-178386.