Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Kim, Young Jin; Henry, Rawn; Fahim, Raffy; Awadalla, Hany Hassan

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2211

Computer Science > Computation and Language

Title: Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Authors: Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

(Submitted on 18 Nov 2022)

Abstract: Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large memory requirements and inefficient inference. In this work, we introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models and cut down the memory consumption significantly. While we achieve up to 26x speed-up in terms of throughput, we also reduce the model size almost to one eighth of the original 32-bit float model by quantizing expert weights into 4-bit integers. As a result, we are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions. This enables a paradigm shift in deploying large scale multilingual MoE transformers models replacing the traditional practice of distilling teacher models into dozens of smaller models per language or task.

Comments:	Accepted to SustaiNLP 2022 (EMNLP 2022)
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2211.10017 [cs.CL]
	(or arXiv:2211.10017v1 [cs.CL] for this version)

Submission history

From: Young Jin Kim [view email]
[v1] Fri, 18 Nov 2022 03:43:52 GMT (157kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2211.10017

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Submission history