Towards Language-guided Visual Recognition via Dynamic Convolutions

Luo, Gen; Zhou, Yiyi; Sun, Xiaoshuai; Wu, Yongjian; Gao, Yue; Ji, Rongrong

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2110

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Towards Language-guided Visual Recognition via Dynamic Convolutions

Authors: Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yongjian Wu, Yue Gao, Rongrong Ji

(Submitted on 17 Oct 2021 (v1), last revised 14 Sep 2023 (this version, v2))

Abstract: In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental results not only shows the performance gains of LaConv compared to the existing multi-modal modules, but also witness the merits of LaConvNet as an unified network, including compact network, high generalization ability and excellent performance, e.g., +4.7% on RefCOCO+.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2110.08797 [cs.CV]
	(or arXiv:2110.08797v2 [cs.CV] for this version)

Submission history

From: Gen Luo [view email]
[v1] Sun, 17 Oct 2021 11:29:13 GMT (516kb,D)
[v2] Thu, 14 Sep 2023 13:37:38 GMT (3045kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2110.08797

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Towards Language-guided Visual Recognition via Dynamic Convolutions

Submission history