References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
(Submitted on 9 Nov 2021 (v1), last revised 25 Nov 2022 (this version, v5))
Abstract: We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.
Submission history
From: Pritam Sarkar [view email][v1] Tue, 9 Nov 2021 20:24:19 GMT (16265kb,D)
[v2] Sun, 14 Nov 2021 22:48:25 GMT (16265kb,D)
[v3] Thu, 21 Apr 2022 04:37:44 GMT (19488kb,D)
[v4] Tue, 22 Nov 2022 00:14:09 GMT (5359kb,D)
[v5] Fri, 25 Nov 2022 04:41:38 GMT (5340kb,D)
Link back to: arXiv, form interface, contact.