Lipschitz Constrained Parameter Initialization for Deep Transformers

Xu, Hongfei; Liu, Qiuhui; van Genabith, Josef; Xiong, Deyi; Zhang, Jingyi

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 1911

Computer Science > Computation and Language

Title: Lipschitz Constrained Parameter Initialization for Deep Transformers

Authors: Hongfei Xu, Qiuhui Liu, Josef van Genabith, Deyi Xiong, Jingyi Zhang

(Submitted on 8 Nov 2019 (v1), last revised 5 May 2020 (this version, v2))

Abstract: The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder/decoder structure. Previous research shows that even with residual connection and layer normalization, deep Transformers still have difficulty in training, and particularly Transformer models with more than 12 encoder/decoder layers fail to converge. In this paper, we first empirically demonstrate that a simple modification made in the official implementation, which changes the computation order of residual connection and layer normalization, can significantly ease the optimization of deep Transformers. We then compare the subtle differences in computation order in considerable detail, and present a parameter initialization method that leverages the Lipschitz constraint on the initialization of Transformer parameters that effectively ensures training convergence. In contrast to findings in previous research we further demonstrate that with Lipschitz parameter initialization, deep Transformers with the original computation order can converge, and obtain significant BLEU improvements with up to 24 layers. In contrast to previous research which focuses on deep encoders, our approach additionally enables Transformers to also benefit from deep decoders.

Subjects:	Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:1911.03179 [cs.CL]
	(or arXiv:1911.03179v2 [cs.CL] for this version)

Submission history

From: Hongfei Xu [view email]
[v1] Fri, 8 Nov 2019 10:52:43 GMT (236kb,D)
[v2] Tue, 5 May 2020 13:08:37 GMT (77kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1911.03179

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Lipschitz Constrained Parameter Initialization for Deep Transformers

Submission history