Video Question Answering with Phrases via Semantic Roles

Sadhu, Arka; Chen, Kan; Nevatia, Ram

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2104

Computer Science > Computer Vision and Pattern Recognition

Title: Video Question Answering with Phrases via Semantic Roles

Authors: Arka Sadhu, Kan Chen, Ram Nevatia

(Submitted on 8 Apr 2021)

Abstract: Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models' application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and Charades-SRL-QA and benchmark them by extending three vision-language models. We further perform extensive analysis and ablative studies to guide future work.

Comments:	NAACL21 Camera Ready including appendix
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2104.03762 [cs.CV]
	(or arXiv:2104.03762v1 [cs.CV] for this version)

Submission history

From: Arka Sadhu [view email]
[v1] Thu, 8 Apr 2021 13:27:43 GMT (2601kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2104.03762

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Video Question Answering with Phrases via Semantic Roles

Submission history