Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Dou, Zi-Yi; Kamath, Aishwarya; Gan, Zhe; Zhang, Pengchuan; Wang, Jianfeng; Li, Linjie; Liu, Zicheng; Liu, Ce; LeCun, Yann; Peng, Nanyun; Gao, Jianfeng; Wang, Lijuan

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2206

Computer Science > Computer Vision and Pattern Recognition

Title: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Authors: Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan Wang

(Submitted on 15 Jun 2022 (v1), last revised 18 Nov 2022 (this version, v2))

Abstract: Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at this https URL

Comments:	NeurIPS 2022. Project Website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2206.07643 [cs.CV]
	(or arXiv:2206.07643v2 [cs.CV] for this version)

Submission history

From: Zi-Yi Dou [view email]
[v1] Wed, 15 Jun 2022 16:41:29 GMT (18373kb,D)
[v2] Fri, 18 Nov 2022 18:23:08 GMT (18376kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2206.07643

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Submission history