Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Sarkar, Pritam; Etemad, Ali

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2111

Computer Science > Computer Vision and Pattern Recognition

Title: Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Authors: Pritam Sarkar, Ali Etemad

(Submitted on 9 Nov 2021 (v1), last revised 25 Nov 2022 (this version, v5))

Abstract: We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.

Comments:	Accepted in AAAI 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2111.05329 [cs.CV]
	(or arXiv:2111.05329v5 [cs.CV] for this version)

Submission history

From: Pritam Sarkar [view email]
[v1] Tue, 9 Nov 2021 20:24:19 GMT (16265kb,D)
[v2] Sun, 14 Nov 2021 22:48:25 GMT (16265kb,D)
[v3] Thu, 21 Apr 2022 04:37:44 GMT (19488kb,D)
[v4] Tue, 22 Nov 2022 00:14:09 GMT (5359kb,D)
[v5] Fri, 25 Nov 2022 04:41:38 GMT (5340kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2111.05329

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Submission history