References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: Transformed ROIs for Capturing Visual Transformations in Videos
(Submitted on 6 Jun 2021 (v1), last revised 5 Nov 2022 (this version, v2))
Abstract: Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, thus contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The module relates localized visual entities such as hands and interacting objects and transforms their corresponding regions of interest directly in the feature maps of convolutional layers. With TROI, we achieve state-of-the-art action recognition results on the large-scale datasets Something-Something-V2 and EPIC-Kitchens-100.
Submission history
From: Fadime Sener [view email][v1] Sun, 6 Jun 2021 15:59:53 GMT (7946kb,D)
[v2] Sat, 5 Nov 2022 17:57:37 GMT (8003kb,D)
Link back to: arXiv, form interface, contact.