Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Zhang, Le; Awal, Rabiul; Agrawal, Aishwarya

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2306

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Authors: Le Zhang, Rabiul Awal, Aishwarya Agrawal

(Submitted on 15 Jun 2023 (v1), last revised 25 Apr 2024 (this version, v4))

Abstract: Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at this https URL

Comments:	CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.08832 [cs.CV]
	(or arXiv:2306.08832v4 [cs.CV] for this version)

Submission history

From: Le Zhang [view email]
[v1] Thu, 15 Jun 2023 03:26:28 GMT (10447kb,D)
[v2] Sun, 2 Jul 2023 00:31:36 GMT (10449kb,D)
[v3] Thu, 28 Dec 2023 15:44:04 GMT (16623kb,D)
[v4] Thu, 25 Apr 2024 15:24:11 GMT (14531kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2306.08832

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Submission history