CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Vasu, Pavan Kumar Anasosalu; Pouransari, Hadi; Faghri, Fartash; Tuzel, Oncel

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2405

Computer Science > Computer Vision and Pattern Recognition

Title: CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Authors: Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Oncel Tuzel

(Submitted on 14 May 2024)

Abstract: CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2405.08911 [cs.CV]
	(or arXiv:2405.08911v1 [cs.CV] for this version)

Submission history

From: Pavan Kumar Anasosalu Vasu [view email]
[v1] Tue, 14 May 2024 19:06:24 GMT (2132kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.08911

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Submission history