We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CV

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computer Vision and Pattern Recognition

Title: Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Abstract: We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.
Comments: Accepted in AAAI 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2111.05329 [cs.CV]
  (or arXiv:2111.05329v5 [cs.CV] for this version)

Submission history

From: Pritam Sarkar [view email]
[v1] Tue, 9 Nov 2021 20:24:19 GMT (16265kb,D)
[v2] Sun, 14 Nov 2021 22:48:25 GMT (16265kb,D)
[v3] Thu, 21 Apr 2022 04:37:44 GMT (19488kb,D)
[v4] Tue, 22 Nov 2022 00:14:09 GMT (5359kb,D)
[v5] Fri, 25 Nov 2022 04:41:38 GMT (5340kb,D)

Link back to: arXiv, form interface, contact.