Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Fang, Yuxin; Yang, Shusheng; Wang, Shijie; Ge, Yixiao; Shan, Ying; Wang, Xinggang

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2204

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Authors: Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang

(Submitted on 6 Apr 2022 (v1), last revised 19 May 2022 (this version, v2))

Abstract: We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% $\sim$ 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained large kernel patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the 3$^{rd}$-stage of our detector's backbone instead of the whole feature extractor. This results in a ConvNet-ViT hybrid feature extractor. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 box AP and 2.6 mask AP on COCO, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8$\times$ faster. Code and pre-trained models are available at this https URL

Comments:	v2: more analysis & stronger results. Preprint. Work in progress. Code and pre-trained models are available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.02964 [cs.CV]
	(or arXiv:2204.02964v2 [cs.CV] for this version)

Submission history

From: Yuxin Fang [view email]
[v1] Wed, 6 Apr 2022 17:59:04 GMT (895kb,D)
[v2] Thu, 19 May 2022 03:41:11 GMT (2047kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2204.02964

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Submission history