Current browse context:
cs.CV
Change to browse by:
References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks
(Submitted on 16 Apr 2021 (v1), last revised 12 Jun 2022 (this version, v2))
Abstract: Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on a challenging video-grounded dialogue benchmark as well as a video QA benchmark.
Submission history
From: Hung Le [view email][v1] Fri, 16 Apr 2021 06:47:41 GMT (1532kb,D)
[v2] Sun, 12 Jun 2022 14:13:09 GMT (2105kb,D)
Link back to: arXiv, form interface, contact.