MaskViT: Masked Visual Pre-Training for Video Prediction

Gupta, Agrim; Tian, Stephen; Zhang, Yunzhi; Wu, Jiajun; Martín-Martín, Roberto; Fei-Fei, Li

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2206

Computer Science > Computer Vision and Pattern Recognition

Title: MaskViT: Masked Visual Pre-Training for Video Prediction

Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei

(Submitted on 23 Jun 2022 (v1), last revised 6 Aug 2022 (this version, v2))

Abstract: The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:	arXiv:2206.11894 [cs.CV]
	(or arXiv:2206.11894v2 [cs.CV] for this version)

Submission history

From: Agrim Gupta [view email]
[v1] Thu, 23 Jun 2022 17:59:33 GMT (13605kb,D)
[v2] Sat, 6 Aug 2022 10:09:47 GMT (13607kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2206.11894

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: MaskViT: Masked Visual Pre-Training for Video Prediction

Submission history