A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Mo, Shentong; Morgado, Pedro

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2305

Computer Science > Sound

Title: A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Authors: Shentong Mo, Pedro Morgado

(Submitted on 30 May 2023)

Abstract: The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source separation using a traditional mix-and-separate framework. Finally, the third objective reinforces visual feature separation and localization by mixing images in pixel space and aligning their representations with those of all corresponding sound sources. Extensive experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks, audio-visual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them.

Subjects:	Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2305.19458 [cs.SD]
	(or arXiv:2305.19458v1 [cs.SD] for this version)

Submission history

From: Shentong Mo [view email]
[v1] Tue, 30 May 2023 23:53:12 GMT (1084kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2305.19458

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Sound

Title: A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Submission history