VLG-Net: Video-Language Graph Matching Network for Video Grounding

Qu, Sisi; Soldan, Mattia; Xu, Mengmeng; Tegner, Jesper; Ghanem, Bernard

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2011

Computer Science > Computer Vision and Pattern Recognition

Title: VLG-Net: Video-Language Graph Matching Network for Video Grounding

Authors: Sisi Qu, Mattia Soldan, Mengmeng Xu, Jesper Tegner, Bernard Ghanem

(Submitted on 19 Nov 2020 (this version), latest version 16 Aug 2021 (v2))

Abstract: Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands the understanding of videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the domains, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs, built on top of video snippets and query tokens separately, which are used for modeling the intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with natural language queries: ActivityNet-Captions, TACoS, and DiDeMo.

Comments:	12 pages, 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2011.10132 [cs.CV]
	(or arXiv:2011.10132v1 [cs.CV] for this version)

Submission history

From: Mattia Soldan [view email]
[v1] Thu, 19 Nov 2020 22:32:03 GMT (16459kb,D)
[v2] Mon, 16 Aug 2021 14:53:59 GMT (13963kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2011.10132v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: VLG-Net: Video-Language Graph Matching Network for Video Grounding

Submission history