Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

Gupta, Vipul; Choudhary, Dhruv; Tang, Ping Tak Peter; Wei, Xiaohan; Wang, Xing; Huang, Yuzhen; Kejariwal, Arun; Ramchandran, Kannan; Mahoney, Michael W.

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2010

Computer Science > Machine Learning

Title: Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

Authors: Vipul Gupta, Dhruv Choudhary, Ping Tak Peter Tang, Xiaohan Wei, Xing Wang, Yuzhen Huang, Arun Kejariwal, Kannan Ramchandran, Michael W. Mahoney

(Submitted on 18 Oct 2020 (v1), last revised 21 May 2021 (this version, v2))

Abstract: In this paper, we consider hybrid parallelism -- a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP) -- to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least $100\times$ and $20\times$ during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37\%, without any loss in performance.

Comments:	27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2021)
Subjects:	Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
Cite as:	arXiv:2010.08899 [cs.LG]
	(or arXiv:2010.08899v2 [cs.LG] for this version)

Submission history

From: Vipul Gupta [view email]
[v1] Sun, 18 Oct 2020 01:44:42 GMT (1390kb,D)
[v2] Fri, 21 May 2021 08:23:19 GMT (5452kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2010.08899

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

Submission history