End-to-end Generative Pretraining for Multimodal Video Captioning

Seo, Paul Hongsuck; Nagrani, Arsha; Arnab, Anurag; Schmid, Cordelia

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2201

Computer Science > Computer Vision and Pattern Recognition

Title: End-to-end Generative Pretraining for Multimodal Video Captioning

Authors: Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid

(Submitted on 20 Jan 2022 (v1), last revised 10 May 2022 (this version, v2))

Abstract: Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Journal reference:	Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR) 2022
Cite as:	arXiv:2201.08264 [cs.CV]
	(or arXiv:2201.08264v2 [cs.CV] for this version)

Submission history

From: Paul Hongsuck Seo [view email]
[v1] Thu, 20 Jan 2022 16:16:21 GMT (2196kb,D)
[v2] Tue, 10 May 2022 09:36:22 GMT (7456kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2201.08264

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: End-to-end Generative Pretraining for Multimodal Video Captioning

Submission history