VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Qiu, Longtian; Zhang, Renrui; Guo, Ziyu; Zeng, Ziyao; Guo, Zilu; Li, Yafeng; Zhang, Guangnan

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2112

Computer Science > Computer Vision and Pattern Recognition

Title: VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Authors: Longtian Qiu, Renrui Zhang, Ziyu Guo, Ziyao Zeng, Zilu Guo, Yafeng Li, Guangnan Zhang

(Submitted on 4 Dec 2021 (v1), last revised 10 Aug 2023 (this version, v3))

Abstract: Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2112.02399 [cs.CV]
	(or arXiv:2112.02399v3 [cs.CV] for this version)

Submission history

From: Longtian Qiu [view email]
[v1] Sat, 4 Dec 2021 18:34:24 GMT (1500kb,D)
[v2] Thu, 3 Nov 2022 08:23:13 GMT (923kb,D)
[v3] Thu, 10 Aug 2023 15:31:54 GMT (923kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2112.02399

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Submission history