Learning Light-Weight Translation Models from Deep Transformer

Li, Bei; Wang, Ziyang; Liu, Hui; Du, Quan; Xiao, Tong; Zhang, Chunliang; Zhu, Jingbo

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2012

Change to browse by:

Computer Science > Computation and Language

Title: Learning Light-Weight Translation Models from Deep Transformer

Authors: Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang, Jingbo Zhu

(Submitted on 27 Dec 2020)

Abstract: Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at this https URL

Comments:	Accepted by AAAI2021
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2012.13866 [cs.CL]
	(or arXiv:2012.13866v1 [cs.CL] for this version)

Submission history

From: Li Bei [view email]
[v1] Sun, 27 Dec 2020 05:33:21 GMT (35kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2012.13866

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Learning Light-Weight Translation Models from Deep Transformer

Submission history