Transformed ROIs for Capturing Visual Transformations in Videos

Rai, Abhinav; Sener, Fadime; Yao, Angela

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2106

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Transformed ROIs for Capturing Visual Transformations in Videos

Authors: Abhinav Rai, Fadime Sener, Angela Yao

(Submitted on 6 Jun 2021 (v1), last revised 5 Nov 2022 (this version, v2))

Abstract: Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, thus contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The module relates localized visual entities such as hands and interacting objects and transforms their corresponding regions of interest directly in the feature maps of convolutional layers. With TROI, we achieve state-of-the-art action recognition results on the large-scale datasets Something-Something-V2 and EPIC-Kitchens-100.

Comments:	CVIU 2022 - Computer Vision and Image Understanding
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2106.03162 [cs.CV]
	(or arXiv:2106.03162v2 [cs.CV] for this version)

Submission history

From: Fadime Sener [view email]
[v1] Sun, 6 Jun 2021 15:59:53 GMT (7946kb,D)
[v2] Sat, 5 Nov 2022 17:57:37 GMT (8003kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2106.03162

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Transformed ROIs for Capturing Visual Transformations in Videos

Submission history