Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Li, Shuang; Du, Yilun; Torralba, Antonio; Sivic, Josef; Russell, Bryan

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2110

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Authors: Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell

(Submitted on 7 Oct 2021)

Abstract: We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2110.03562 [cs.CV]
	(or arXiv:2110.03562v1 [cs.CV] for this version)

Submission history

From: Shuang Li [view email]
[v1] Thu, 7 Oct 2021 15:30:18 GMT (28976kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2110.03562

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Submission history