Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Yang, Antoine; Miech, Antoine; Sivic, Josef; Laptev, Ivan; Schmid, Cordelia

Full-text links:

Download:

Computer Science > Computer Vision and Pattern Recognition

Title: Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

(Submitted on 16 Jun 2022 (v1), last revised 10 Oct 2022 (this version, v2))

Abstract: Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at this https URL

Comments:	NeurIPS 2022 Camera-Ready; Project Webpage: this https URL; 25 pages; 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2206.08155 [cs.CV]
	(or arXiv:2206.08155v2 [cs.CV] for this version)

Submission history

From: Antoine Yang [view email]
[v1] Thu, 16 Jun 2022 13:18:20 GMT (3341kb,D)
[v2] Mon, 10 Oct 2022 15:08:43 GMT (3295kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2206.08155

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Submission history