Current browse context:
cs.CV
Change to browse by:
References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: GLIPv2: Unifying Localization and Vision-Language Understanding
(Submitted on 12 Jun 2022 (v1), last revised 11 Oct 2022 (this version, v2))
Abstract: We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at this https URL
Submission history
From: Haotian Zhang [view email][v1] Sun, 12 Jun 2022 20:31:28 GMT (47405kb,D)
[v2] Tue, 11 Oct 2022 23:27:03 GMT (47559kb,D)
Link back to: arXiv, form interface, contact.