Multimodal Pretraining for Dense Video Captioning

Huang, Gabriel; Pang, Bo; Zhu, Zhenhai; Rivera, Clara; Soricut, Radu

Full-text links:

Download:

Computer Science > Computer Vision and Pattern Recognition

Title: Multimodal Pretraining for Dense Video Captioning

Authors: Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut

(Submitted on 10 Nov 2020)

Abstract: Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

Comments:	AACL-IJCNLP 2020
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2011.11760 [cs.CV]
	(or arXiv:2011.11760v1 [cs.CV] for this version)

Submission history

From: Gabriel Huang [view email]
[v1] Tue, 10 Nov 2020 21:49:14 GMT (2420kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2011.11760

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Multimodal Pretraining for Dense Video Captioning

Submission history