We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CV

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computer Vision and Pattern Recognition

Title: Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Abstract: We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2110.03562 [cs.CV]
  (or arXiv:2110.03562v1 [cs.CV] for this version)

Submission history

From: Shuang Li [view email]
[v1] Thu, 7 Oct 2021 15:30:18 GMT (28976kb,D)

Link back to: arXiv, form interface, contact.