Current browse context:
cs.SD
Change to browse by:
References & Citations
Computer Science > Sound
Title: Sound Event Detection of Weakly Labelled Data with CNN-Transformer and Automatic Threshold Optimization
(Submitted on 10 Dec 2019 (this version), latest version 23 Aug 2020 (v2))
Abstract: Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. To address the weakly labelled SED problem, we investigate segment-wise training and clip-wise training methods. The proposed systems are based on the variants of convolutional neural networks (CNNs) including convolutional recurrent neural networks and our proposed CNN-transformers for audio tagging and sound event detection. Another challenge of SED is that only the presence probabilities of sound events are predicted and thresholds are required to predict the presence or absence of sound events. Previous work set this threshold empirically which is not an optimised solution. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on the thresholds such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to the metric that depends on those thresholds. This proposed automatic threshold optimization system achieved state-of-the-art audio tagging and SED F1 score of 0.646, 0.584, outperforming the performance with best manually selected thresholds of 0.629 and 0.564, respectively.
Submission history
From: Qiuqiang Kong [view email][v1] Tue, 10 Dec 2019 15:25:37 GMT (828kb,D)
[v2] Sun, 23 Aug 2020 10:30:01 GMT (813kb,D)
Link back to: arXiv, form interface, contact.