A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4

Li, Yiming; Guo, Zhifang; Ye, Zhirong; Wang, Xiangdong; Liu, Hong; Qian, Yueliang; Tao, Rui; Yan, Long; Ouchi, Kazushige

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2210

Computer Science > Sound

Title: A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4

Authors: Yiming Li, Zhifang Guo, Zhirong Ye, Xiangdong Wang, Hong Liu, Yueliang Qian, Rui Tao, Long Yan, Kazushige Ouchi

(Submitted on 18 Oct 2022)

Abstract: In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at this https URL

Comments:	5 pages, 2 figures, accepted for publication in DCASE2022 Workshop
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2210.09529 [cs.SD]
	(or arXiv:2210.09529v1 [cs.SD] for this version)

Submission history

From: Yiming Li [view email]
[v1] Tue, 18 Oct 2022 01:47:05 GMT (120kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2210.09529

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Sound

Title: A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4

Submission history