We gratefully acknowledge support from
the Simons Foundation and member institutions.

Audio and Speech Processing

New submissions

[ total of 10 entries: 1-10 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Wed, 29 Mar 23

[1]  arXiv:2303.15669 [pdf, other]
Title: Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages
Comments: ICASSP 2023
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewarping

[2]  arXiv:2303.15703 [pdf, other]
Title: AD-YOLO: You Look ONly Once in Training Multiple Sound Event Localization and Detection
Comments: 5 pages, 3 figures, accepted for publication in IEEE ICASSP 2023
Subjects: Audio and Speech Processing (eess.AS)

Sound event localization and detection (SELD) combines the identification of sound events with the corresponding directions of arrival (DOA). Recently, event-oriented track output formats have been adopted to solve this problem; however, they still have limited generalization toward real-world problems in an unknown polyphony environment. To address the issue, we proposed an angular-distance-based multiple SELD (AD-YOLO), which is an adaptation of the "You Look Only Once" algorithm for SELD. The AD-YOLO format allows the model to learn sound occurrences location-sensitively by assigning class responsibility to DOA predictions. Hence, the format enables the model to handle the polyphony problem, regardless of the number of sound overlaps. We evaluated AD-YOLO on DCASE 2020-2022 challenge Task 3 datasets using four SELD objective metrics. The experimental results show that AD-YOLO achieved outstanding performance overall and also accomplished robustness in class-homogeneous polyphony environments.

[3]  arXiv:2303.16021 [pdf, ps, other]
Title: Spatial Active Noise Control Method Based On Sound Field Interpolation From Reference Microphone Signals
Comments: Accepted to International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

A spatial active noise control (ANC) method based on the interpolation of a sound field from reference microphone signals is proposed. In most current spatial ANC methods, a sufficient number of error microphones are required to reduce noise over the target region because the sound field is estimated from error microphone signals. However, in practical applications, it is preferable that the number of error microphones is as small as possible to keep a space in the target region for ANC users. We propose to interpolate the sound field from reference microphones, which are normally placed outside the target region, instead of the error microphones. We derive a fixed filter for spatial noise reduction on the basis of the kernel ridge regression for sound field interpolation. Furthermore, to compensate for estimation errors, we combine the proposed fixed filter with multichannel ANC based on a transition of the control filter using the error microphone signals. Numerical experimental results indicate that regional noise can be sufficiently reduced by the proposed methods even when the number of error microphones is particularly small.

Cross-lists for Wed, 29 Mar 23

[4]  arXiv:2303.15705 (cross-list from cs.CL) [pdf, other]
Title: Translate the Beauty in Songs: Jointly Learning to Align Melody and Translate Lyrics
Comments: 13 pages
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Song translation requires both translation of lyrics and alignment of music notes so that the resulting verse can be sung to the accompanying melody, which is a challenging problem that has attracted some interests in different aspects of the translation process. In this paper, we propose Lyrics-Melody Translation with Adaptive Grouping (LTAG), a holistic solution to automatic song translation by jointly modeling lyrics translation and lyrics-melody alignment. It is a novel encoder-decoder framework that can simultaneously translate the source lyrics and determine the number of aligned notes at each decoding step through an adaptive note grouping module. To address data scarcity, we commissioned a small amount of training data annotated specifically for this task and used large amounts of augmented data through back-translation. Experiments conducted on an English-Chinese song translation data set show the effectiveness of our model in both automatic and human evaluation.

[5]  arXiv:2303.15734 (cross-list from cs.SD) [pdf, other]
Title: Adaptive Background Music for a Fighting Game: A Multi-Instrument Volume Modulation Approach
Comments: This paper under review is made available for participants of DareFightingICE Competition (this https URL) and readers interested in relevant areas
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

This paper presents our work to enhance the background music (BGM) in DareFightingICE by adding an adaptive BGM. The adaptive BGM consists of five different instruments playing a classical music piece called "Air on G-String." The BGM adapts by changing the volume of the instruments. Each instrument is connected to a different element of the game. We then run experiments to evaluate the adaptive BGM by using a deep reinforcement learning AI that only uses audio as input (Blind DL AI). The results show that the performance of the Blind DL AI improves while playing with the adaptive BGM as compared to playing without the adaptive BGM.

[6]  arXiv:2303.15940 (cross-list from cs.SD) [pdf, other]
Title: TransAudio: Towards the Transferable Adversarial Audio Attack via Learning Contextualized Perturbations
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

In a transfer-based attack against Automatic Speech Recognition (ASR) systems, attacks are unable to access the architecture and parameters of the target model. Existing attack methods are mostly investigated in voice assistant scenarios with restricted voice commands, prohibiting their applicability to more general ASR related applications. To tackle this challenge, we propose a novel contextualized attack with deletion, insertion, and substitution adversarial behaviors, namely TransAudio, which achieves arbitrary word-level attacks based on the proposed two-stage framework. To strengthen the attack transferability, we further introduce an audio score-matching optimization strategy to regularize the training process, which mitigates adversarial example over-fitting to the surrogate model. Extensive experiments and analysis demonstrate the effectiveness of TransAudio against open-source ASR models and commercial APIs.

[7]  arXiv:2303.15944 (cross-list from cs.LG) [pdf, other]
Title: Cluster-Guided Unsupervised Domain Adaptation for Deep Speaker Embedding
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Recent studies have shown that pseudo labels can contribute to unsupervised domain adaptation (UDA) for speaker verification. Inspired by the self-training strategies that use an existing classifier to label the unlabeled data for retraining, we propose a cluster-guided UDA framework that labels the target domain data by clustering and combines the labeled source domain data and pseudo-labeled target domain data to train a speaker embedding network. To improve the cluster quality, we train a speaker embedding network dedicated for clustering by minimizing the contrastive center loss. The goal is to reduce the distance between an embedding and its assigned cluster center while enlarging the distance between the embedding and the other cluster centers. Using VoxCeleb2 as the source domain and CN-Celeb1 as the target domain, we demonstrate that the proposed method can achieve an equal error rate (EER) of 8.10% on the CN-Celeb1 evaluation set without using any labels from the target domain. This result outperforms the supervised baseline by 39.6% and is the state-of-the-art UDA performance on this corpus.

[8]  arXiv:2303.16024 (cross-list from cs.CV) [pdf, other]
Title: Egocentric Auditory Attention Localization in Conversations
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal

[9]  arXiv:2303.16031 (cross-list from cs.CR) [pdf, ps, other]
Title: A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network
Comments: Accepted by the Interspeech 2022. The first two authors contributed equally to this work
Subjects: Cryptography and Security (cs.CR); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Speaker verification has been widely used in many authentication scenarios. However, training models for speaker verification requires large amounts of data and computing power, so users often use untrustworthy third-party data or deploy third-party models directly, which may create security risks. In this paper, we propose a backdoor attack for the above scenario. Specifically, for the Siamese network in the speaker verification system, we try to implant a universal identity in the model that can simulate any enrolled speaker and pass the verification. So the attacker does not need to know the victim, which makes the attack more flexible and stealthy. In addition, we design and compare three ways of selecting attacker utterances and two ways of poisoned training for the GE2E loss function in different scenarios. The results on the TIMIT and Voxceleb1 datasets show that our approach can achieve a high attack success rate while guaranteeing the normal verification accuracy. Our work reveals the vulnerability of the speaker verification system and provides a new perspective to further improve the robustness of the system.

Replacements for Wed, 29 Mar 23

[10]  arXiv:2211.16764 (replaced) [pdf, other]
Title: A General Unfolding Speech Enhancement Method Motivated by Taylor's Theorem
Comments: Submitted to TASLP, revised version, 17 pages
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[ total of 10 entries: 1-10 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, recent, 2303, contact, help  (Access key information)