Current browse context:
cs.CV
Change to browse by:
References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: Write and Paint: Generative Vision-Language Models are Unified Modal Learners
(Submitted on 15 Jun 2022 (v1), last revised 17 Feb 2023 (this version, v3))
Abstract: Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at this https URL
Submission history
From: Shizhe Diao [view email][v1] Wed, 15 Jun 2022 17:49:38 GMT (1147kb,D)
[v2] Thu, 16 Feb 2023 17:01:44 GMT (1292kb,D)
[v3] Fri, 17 Feb 2023 02:58:03 GMT (1292kb,D)
Link back to: arXiv, form interface, contact.