We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CV

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computer Vision and Pattern Recognition

Title: Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Abstract: Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at this https URL
Comments: CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2306.08832 [cs.CV]
  (or arXiv:2306.08832v4 [cs.CV] for this version)

Submission history

From: Le Zhang [view email]
[v1] Thu, 15 Jun 2023 03:26:28 GMT (10447kb,D)
[v2] Sun, 2 Jul 2023 00:31:36 GMT (10449kb,D)
[v3] Thu, 28 Dec 2023 15:44:04 GMT (16623kb,D)
[v4] Thu, 25 Apr 2024 15:24:11 GMT (14531kb,D)

Link back to: arXiv, form interface, contact.