New submissions for Thu, 6 Oct 22

[1]  arXiv:2210.02287 [pdf]
Title: TC-SKNet with GridMask for Low-complexity Classification of Acoustic scene
Comments: Accepted to APSIPA ASC 2022
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Convolution neural networks (CNNs) have good performance in low-complexity classification tasks such as acoustic scene classifications (ASCs). However, there are few studies on the relationship between the length of target speech and the size of the convolution kernels. In this paper, we combine Selective Kernel Network with Temporal-Convolution (TC-SKNet) to adjust the receptive field of convolution kernels to solve the problem of variable length of target voice while keeping low-complexity. GridMask is a data augmentation strategy by masking part of the raw data or feature area. It can enhance the generalization of the model as the role of dropout. In our experiments, the performance gain brought by GridMask is stronger than spectrum augmentation in ASCs. Finally, we adopt AutoML to search best structure of TC-SKNet and hyperparameters of GridMask for improving the classification performance. As a result, a peak accuracy of 59.87% TC-SKNet is equivalent to that of SOTA, but the parameters only use 20.9 K.

[2]  arXiv:2210.02437 [pdf, other]
Title: ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild
Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing
Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

Benchmarking initiatives support the meaningful comparison of competing solutions to prominent problems in speech and language processing. Successive benchmarking evaluations typically reflect a progressive evolution from ideal lab conditions towards to those encountered in the wild. ASVspoof, the spoofing and deepfake detection initiative and challenge series, has followed the same trend. This article provides a summary of the ASVspoof 2021 challenge and the results of 37 participating teams. For the logical access task, results indicate that countermeasures solutions are robust to newly introduced encoding and transmission effects. Results for the physical access task indicate the potential to detect replay attacks in real, as opposed to simulated physical spaces, but a lack of robustness to variations between simulated and real acoustic environments. The DF task, new to the 2021 edition, targets solutions to the detection of manipulated, compressed speech data posted online. While detection solutions offer some resilience to compression effects, they lack generalization across different source datasets. In addition to a summary of the top-performing systems for each task, new analyses of influential data factors and results for hidden data subsets, the article includes a review of post-challenge results, an outline of the principal challenge limitations and a road-map for the future of ASVspoof. Link to the ASVspoof challenge and related resources: https://www.asvspoof.org/index2021.html

Replacements for Thu, 6 Oct 22

[3]  arXiv:2210.00721 (replaced) [pdf, other]
Title: Efficient acoustic feature transformation in mismatched environments using a Guided-GAN
Comments: Final published version available at: Efficient acoustic feature transformation in mismatched environments using a Guided-GAN. Speech Communication, 143, pp.10-20
Journal-ref: Speech Communication, 143, pp.10-20 (2022)
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[4]  arXiv:2210.01353 (replaced) [pdf, other]
Title: Pay Self-Attention to Audio-Visual Navigation
Comments: Main paper (10 pages and 7 figures) and appendix (21 figures and 4 tables). Accepted for publication by BMVC 2022. For data and code, see this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[5]  arXiv:2210.01448 (replaced) [pdf, other]
Title: Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings
Comments: SIGGRAPH Asia 2022 (Journal Track); Project Page: this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
[6]  arXiv:2210.01719 (replaced) [pdf, other]
Title: Learning the Spectrogram Temporal Resolution for Audio Classification
Comments: Under review. Code open-sourced at this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[7]  arXiv:2107.09667 (replaced) [pdf, other]
Title: Human Perception of Audio Deepfakes
Comments: Published at ACM Multimedia 2022 Workshop DDAM
Journal-ref: First International Workshop on Deepfake Detection for Audio Multimedia at ACM Multimedia 2022
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[8]  arXiv:2207.06127 (replaced) [pdf, other]
Title: MM-ALT: A Multimodal Automatic Lyric Transcription System
Comments: Accepted by ACM Multimedia 2022. Camera ready version and correct some typos
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
