Boosting Distributed Training Performance of the Unpadded BERT Model

Zeng, Jinle; Li, Min; Wu, Zhihua; Liu, Jiaqi; Liu, Yuang; Yu, Dianhai; Ma, Yanjun

Full-text links:

Download:

Current browse context:

cs.DC

< prev | next >

new | recent | 2208

Change to browse by:

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Boosting Distributed Training Performance of the Unpadded BERT Model

Authors: Jinle Zeng, Min Li, Zhihua Wu, Jiaqi Liu, Yuang Liu, Dianhai Yu, Yanjun Ma

(Submitted on 17 Aug 2022)

Abstract: Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the MLPerf training benchmark. The distributed training performance optimization of BERT models plays an important role in accelerating the solutions of most NLP tasks. BERT model often uses padding tensors as its inputs, leading to excessive redundant computations. Thus, removing these redundant computations is essential to improve the distributed training performance.
This paper designs a new approach to train BERT models with variable-length inputs efficiently. Firstly, we propose a general structure for the variable-length BERT models, and accelerate the encoder layer via our grouped multi-stream FMHA (Fused Multi-Head Attention) method. Secondly, through data exchange, we address the unbalanced workload problem caused by the variable-length inputs, which overlaps highly with the training process. Finally, we optimize the overall performance of the BERT model, such as kernel fusion, and operator optimization. Our experimental results show that our highly optimized BERT model achieves state-of-the-art throughput and ranks first in MLPerf Training v2.0 within the same GPU configuration. The optimizations in this paper can be applied to more BERT-like models in our future works.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2208.08124 [cs.DC]
	(or arXiv:2208.08124v1 [cs.DC] for this version)

Submission history

From: Jinle Zeng [view email]
[v1] Wed, 17 Aug 2022 07:40:20 GMT (1175kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2208.08124

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Boosting Distributed Training Performance of the Unpadded BERT Model

Submission history