Caption supervision enables robust learners

Feuer, Benjamin; Joshi, Ameya; Hegde, Chinmay

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2210

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Caption supervision enables robust learners

Authors: Benjamin Feuer, Ameya Joshi, Chinmay Hegde

(Submitted on 13 Oct 2022 (v1), last revised 8 Dec 2022 (this version, v2))

Abstract: Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (this https URL), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at VL Hub (this https URL).

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
ACM classes:	I.4.9
Cite as:	arXiv:2210.07396 [cs.CV]
	(or arXiv:2210.07396v2 [cs.CV] for this version)

Submission history

From: Benjamin Feuer [view email]
[v1] Thu, 13 Oct 2022 22:29:10 GMT (15283kb,D)
[v2] Thu, 8 Dec 2022 14:28:09 GMT (15391kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2210.07396

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Caption supervision enables robust learners

Submission history