UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Luo, Huaishao; Ji, Lei; Shi, Botian; Huang, Haoyang; Duan, Nan; Li, Tianrui; Li, Jason; Bharti, Taroon; Zhou, Ming

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2002

Computer Science > Computer Vision and Pattern Recognition

Title: UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Authors: Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou

(Submitted on 15 Feb 2020 (v1), last revised 15 Sep 2020 (this version, v3))

Abstract: With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2002.06353 [cs.CV]
	(or arXiv:2002.06353v3 [cs.CV] for this version)

Submission history

From: Huaishao Luo [view email]
[v1] Sat, 15 Feb 2020 10:03:25 GMT (4874kb,D)
[v2] Sat, 1 Aug 2020 14:21:43 GMT (6584kb,D)
[v3] Tue, 15 Sep 2020 13:27:13 GMT (8825kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2002.06353

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Submission history