LazyFormer: Self Attention with Lazy Update

Ying, Chengxuan; Ke, Guolin; He, Di; Liu, Tie-Yan

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2102

Computer Science > Computation and Language

Title: LazyFormer: Self Attention with Lazy Update

Authors: Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu

(Submitted on 25 Feb 2021)

Abstract: Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2102.12702 [cs.CL]
	(or arXiv:2102.12702v1 [cs.CL] for this version)

Submission history

From: Guolin Ke [view email]
[v1] Thu, 25 Feb 2021 06:18:20 GMT (146kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2102.12702

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: LazyFormer: Self Attention with Lazy Update

Submission history