I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

Gu, Sophia; Clark, Christopher; Kembhavi, Aniruddha

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2211

Computer Science > Computer Vision and Pattern Recognition

Title: I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

Authors: Sophia Gu, Christopher Clark, Aniruddha Kembhavi

(Submitted on 17 Nov 2022 (this version), latest version 18 Aug 2023 (v4))

Abstract: Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2211.09778 [cs.CV]
	(or arXiv:2211.09778v1 [cs.CV] for this version)

Submission history

From: Sophia Gu [view email]
[v1] Thu, 17 Nov 2022 18:52:19 GMT (5320kb,D)
[v2] Thu, 1 Dec 2022 18:59:05 GMT (5860kb,D)
[v3] Tue, 21 Mar 2023 04:54:55 GMT (8314kb,D)
[v4] Fri, 18 Aug 2023 23:43:42 GMT (8316kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2211.09778v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

Submission history