Object-Region Video Transformers

Herzig, Roei; Ben-Avraham, Elad; Mangalam, Karttikeya; Bar, Amir; Chechik, Gal; Rohrbach, Anna; Darrell, Trevor; Globerson, Amir

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2110

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Object-Region Video Transformers

Authors: Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

(Submitted on 13 Oct 2021 (v1), last revised 9 Jun 2022 (this version, v3))

Abstract: Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at \url{this https URL}

Comments:	CVPR 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2110.06915 [cs.CV]
	(or arXiv:2110.06915v3 [cs.CV] for this version)

Submission history

From: Roei Herzig [view email]
[v1] Wed, 13 Oct 2021 17:51:46 GMT (4407kb,D)
[v2] Tue, 30 Nov 2021 15:49:19 GMT (5358kb,D)
[v3] Thu, 9 Jun 2022 20:48:45 GMT (5360kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2110.06915

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Object-Region Video Transformers

Submission history