From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

Song, Jingkuan; Zeng, Pengpeng; Gao, Lianli; Shen, Heng Tao

Full-text links:

Download:

Computer Science > Computer Vision and Pattern Recognition

Title: From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

Authors: Jingkuan Song, Pengpeng Zeng, Lianli Gao, Heng Tao Shen

(Submitted on 4 Jun 2022)

Abstract: Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problems. In this paper, we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.01923 [cs.CV]
	(or arXiv:2206.01923v1 [cs.CV] for this version)

Submission history

From: Pengpeng Zeng [view email]
[v1] Sat, 4 Jun 2022 07:03:18 GMT (243kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2206.01923

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

Submission history