Variance-reduced Language Pretraining via a Mask Proposal Network

Chen, Liang

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2008

Computer Science > Computation and Language

Title: Variance-reduced Language Pretraining via a Mask Proposal Network

Authors: Liang Chen

(Submitted on 12 Aug 2020 (v1), last revised 16 Aug 2020 (this version, v2))

Abstract: Self-supervised learning, a.k.a., pretraining, is important in natural language processing. Most of the pretraining methods first randomly mask some positions in a sentence and then train a model to recover the tokens at the masked positions. In such a way, the model can be trained without human labeling, and the massive data can be used with billion parameters. Therefore, the optimization efficiency becomes critical. In this paper, we tackle the problem from the view of gradient variance reduction. In particular, we first propose a principled gradient variance decomposition theorem, which shows that the variance of the stochastic gradient of the language pretraining can be naturally decomposed into two terms: the variance that arises from the sample of data in a batch, and the variance that arises from the sampling of the mask. The second term is the key difference between selfsupervised learning and supervised learning, which makes the pretraining slower. In order to reduce the variance of the second part, we leverage the importance sampling strategy, which aims at sampling the masks according to a proposal distribution instead of the uniform distribution. It can be shown that if the proposal distribution is proportional to the gradient norm, the variance of the sampling is reduced. To improve efficiency, we introduced a MAsk Proposal Network (MAPNet), which approximates the optimal mask proposal distribution and is trained end-to-end along with the model. According to the experimental result, our model converges much faster and achieves higher performance than the baseline BERT model.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2008.05333 [cs.CL]
	(or arXiv:2008.05333v2 [cs.CL] for this version)

Submission history

From: Liang Chen [view email]
[v1] Wed, 12 Aug 2020 14:12:32 GMT (737kb,D)
[v2] Sun, 16 Aug 2020 15:40:33 GMT (728kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2008.05333

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Variance-reduced Language Pretraining via a Mask Proposal Network

Submission history