GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim; Shazeer, Noam; Chen, Zhifeng

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2006

Computer Science > Computation and Language

Title: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Authors: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

(Submitted on 30 Jun 2020)

Abstract: Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2006.16668 [cs.CL]
	(or arXiv:2006.16668v1 [cs.CL] for this version)

Submission history

From: Orhan Firat [view email]
[v1] Tue, 30 Jun 2020 10:42:02 GMT (3323kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2006.16668

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Submission history