SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Xiong, Peixi; You, Quanzeng; Yu, Pei; Liu, Zicheng; Wu, Ying

Full-text links:

Download:

Computer Science > Computer Vision and Pattern Recognition

Title: SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Authors: Peixi Xiong, Quanzeng You, Pei Yu, Zicheng Liu, Ying Wu

(Submitted on 25 Jan 2022)

Abstract: Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2201.10654 [cs.CV]
	(or arXiv:2201.10654v1 [cs.CV] for this version)

Submission history

From: Peixi Xiong [view email]
[v1] Tue, 25 Jan 2022 22:26:09 GMT (2967kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2201.10654

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Submission history