Revisiting the Role of Language Priors in Vision-Language Models

Lin, Zhiqiu; Chen, Xinyue; Pathak, Deepak; Zhang, Pengchuan; Ramanan, Deva

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2306

Computer Science > Computer Vision and Pattern Recognition

Title: Revisiting the Role of Language Priors in Vision-Language Models

Authors: Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

(Submitted on 2 Jun 2023 (v1), revised 5 Oct 2023 (this version, v2), latest version 15 May 2024 (v4))

Abstract: Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.

Comments:	Website: this https URL Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2306.01879 [cs.CV]
	(or arXiv:2306.01879v2 [cs.CV] for this version)

Submission history

From: Zhiqiu Lin [view email]
[v1] Fri, 2 Jun 2023 19:19:43 GMT (8563kb,D)
[v2] Thu, 5 Oct 2023 04:12:28 GMT (13285kb,D)
[v3] Thu, 1 Feb 2024 18:22:25 GMT (10206kb,D)
[v4] Wed, 15 May 2024 07:15:05 GMT (10206kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2306.01879v2

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Revisiting the Role of Language Priors in Vision-Language Models

Submission history