We gratefully acknowledge support from
the Simons Foundation and member institutions.

Sound

New submissions

[ total of 12 entries: 1-12 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 26 Nov 20

[1]  arXiv:2011.12461 [pdf, other]
Title: SAR-Net: A End-to-End Deep Speech Accent Recognition Network
Comments: 10 pages, 7 figures, journal
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)

This paper proposes a end-to-end deep network to recognize kinds of accents under the same language, where we develop and transfer the deep architecture in speaker-recognition area to accent classification task for learning utterance-level accent representation. Compared with the individual-level feature in speaker-recognition, accent recognition throws a more challenging issue in acquiring compact group-level features for the speakers with the same accent, hence a good discriminative accent feature space is desired. Our deep framework adopts multitask-learning mechanism and mainly consists of three modules: a shared CNNs and RNNs based front-end encoder, a core accent recognition branch, and an auxiliary speech recognition branch, where we take speech spectrogram as input. More specifically, with the sequential descriptors learned from a shared encoder, the accent recognition branch first condenses all descriptors into an embedding vector, and then explores different discriminative loss functions which are popular in face recognition domain to enhance embedding discrimination. Additionally, due to the accent is a speaking-related timbre, adding speech recognition branch effectively curbs the over-fitting phenomenon in accent recognition during training. We show that our network without any data-augment preproccessings is significantly ahead of the baseline system on the accent classification track in the Accented English Speech Recognition Challenge 2020 (AESRC2020), where the state-of-the-art loss function Circle-Loss achieves the best discriminative optimization for accent representation.

[2]  arXiv:2011.12536 [pdf, ps, other]
Title: Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding
Authors: Achintya kr. Sarkar, Zheng-Hua Tan (Senior Member, IEEE)
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

In this letter, we propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV), in which a set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision. Next, we explore the bottleneck (BN) feature extracted by training deep neural networks with a self-supervised objective, autoregressive predictive coding (APC), for TD-SV and compare it with the well-studied speaker-discriminant BN feature. The proposed VTL method is then applied to APC and speaker-discriminant BN features. In the end, we combine the VTL perturbation systems trained on MFCC and the two BN features in the score domain. Experiments are performed on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Results show the proposed methods significantly outperform the baselines.

[3]  arXiv:2011.12596 [pdf, other]
Title: MTCRNN: A multi-scale RNN for directed audio texture synthesis
Authors: M. Huzaifah, L. Wyse
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)

Audio textures are a subset of environmental sounds, often defined as having stable statistical characteristics within an adequately large window of time but may be unstructured locally. They include common everyday sounds such as from rain, wind, and engines. Given that these complex sounds contain patterns on multiple timescales, they are a challenge to model with traditional methods. We introduce a novel modelling approach for textures, combining recurrent neural networks trained at different levels of abstraction with a conditioning strategy that allows for user-directed synthesis. We demonstrate the model's performance on a variety of datasets, examine its performance on various metrics, and discuss some potential applications.

[4]  arXiv:2011.12754 [pdf, other]
Title: Feature Selection based on Principal Component Analysis for Underwater Source Localization by Deep Learning
Subjects: Sound (cs.SD); Signal Processing (eess.SP); Atmospheric and Oceanic Physics (physics.ao-ph)

In this paper, we propose an interpretable feature selection method based on principal component analysis (PCA) and principal component regression (PCR), which can extract important features for underwater source localization by only introducing the source location without other prior information. This feature selection method is combined with a two-step framework for underwater source localization based on the semi-supervised learning scheme. In the framework, the first step utilizes a convolutional autoencoder to extract the latent features from the whole available dataset. The second step performs source localization via an encoder multi-layer perceptron (MLP) trained on a limited labeled portion of the dataset. The proposed approach has been validated on the public dataset SwllEx-96 Event S5. The result shows the framework has appealing accuracy and robustness on the unseen data, especially when the number of data used to train gradually decreases. After feature selection, not only the training stage has a 95\% acceleration but the performance of the framework becomes more robust on the depth and more accurate when the number of labeled data used to train is extremely limited.

[5]  arXiv:2011.12818 [pdf, other]
Title: Phase retrieval with Bregman divergences: Application to audio signal recovery
Comments: in Proceedings of iTWIST'20, Paper-ID: 16, Nantes, France, December, 2-4, 2020
Subjects: Sound (cs.SD)

Phase retrieval aims to recover a signal from magnitude or power spectra measurements. It is often addressed by considering a minimization problem involving a quadratic cost function. We propose a different formulation based on Bregman divergences, which encompass divergences that are appropriate for audio signal processing applications. We derive a fast gradient algorithm to solve this problem.

Replacements for Thu, 26 Nov 20

[6]  arXiv:1810.01248 (replaced) [pdf, other]
Title: A Lightweight Music Texture Transfer System
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[7]  arXiv:2010.04301 (replaced) [pdf, other]
Title: Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Comments: Under review as a conference paper at ICLR 2021
Subjects: Sound (cs.SD); Computation and Language (cs.CL)
[8]  arXiv:2010.08123 (replaced) [pdf, other]
Title: Melody Classifier with Stacked-LSTM
Authors: You Li, Zhuowen Lin
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[9]  arXiv:2011.03689 (replaced) [pdf, other]
Title: Detection and Evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems
Comments: 6 pages excluding references. Paper accepted by IEEE Spoken Language Technology (SLT) 2021
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[10]  arXiv:2006.06426 (replaced) [pdf, other]
Title: Deep generative models for musical audio synthesis
Authors: M. Huzaifah, L. Wyse
Comments: This is the authors' own pre-submission version of a chapter for Handbook of Artificial Intelligence for Music: Foundations, Advanced Approaches, and Developments for Creativity, edited by Eduardo R. Miranda, for Springer
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
[11]  arXiv:2010.03360 (replaced) [pdf, other]
Title: Interpreting Imagined Speech Waves with Machine Learning techniques
Subjects: Signal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[12]  arXiv:2011.11715 (replaced) [pdf, other]
Title: Multi-task Language Modeling for Improving Speech Recognition of Rare Words
Comments: Submitted to ICASSP 2021
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[ total of 12 entries: 1-12 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2011, contact, help  (Access key information)