Current browse context:
cs
Change to browse by:
References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
(Submitted on 23 Nov 2022)
Abstract: Recent diffusion-based generative models combined with vision-language models are capable of creating realistic images from natural language prompts. While these models are trained on large internet-scale datasets, such pre-trained models are not directly introduced to any semantic localization or grounding. Most current approaches for localization or grounding rely on human-annotated localization information in the form of bounding boxes or segmentation masks. The exceptions are a few unsupervised methods that utilize architectures or loss functions geared towards localization, but they need to be trained separately. In this work, we explore how off-the-shelf diffusion models, trained with no exposure to such localization information, are capable of grounding various semantic phrases with no segmentation-specific re-training. An inference time optimization process is introduced, that is capable of generating segmentation masks conditioned on natural language. We evaluate our proposal Peekaboo for unsupervised semantic segmentation on the Pascal VOC dataset. In addition, we evaluate for referring segmentation on the RefCOCO dataset. In summary, we present a first zero-shot, open-vocabulary, unsupervised (no localization information), semantic grounding technique leveraging diffusion-based generative models with no re-training. Our code will be released publicly.
Submission history
From: Kanchana Ranasinghe [view email][v1] Wed, 23 Nov 2022 18:59:05 GMT (13039kb,D)
Link back to: arXiv, form interface, contact.