We gratefully acknowledge support from
the Simons Foundation and member institutions.

Audio and Speech Processing

New submissions

[ total of 13 entries: 1-13 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 26 Nov 20

[1]  arXiv:2011.12564 [pdf]
Title: Soft-Median Choice: An Automatic Feature Smoothing Method for Sound Event Detection
Comments: 5 pages, 6 figures, 1 table
Subjects: Audio and Speech Processing (eess.AS)

In existing Sound Event Detection (SED) algorithms, the roughness of extracted feature causes decline of precision and recall. In order to solve this problem, a novel automatic feature smoothing algorithm based on Soft-Median Choice is proposed. Firstly, in Convolutional Recurrent Neural Network (CRNN), 1-dimension (1-D) convolutional layers are added to extract more information temporally. Secondly, a novel module Median Choice with median filters and a Linear Choice is applied in CRNN to automatically get the knowledge of the features with different smoothing levels. Thirdly, a Soft-Median function is designed instead of median function so as to dredge the training path and smooth the training process. In the classifier, Linear Softmax is utilized to avoid the shortcomings of attention. Experiments reveal that the proposed method achieves higher precision and recall than the contrasting algorithms.

[2]  arXiv:2011.12657 [pdf, other]
Title: Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections
Comments: Submitted to ICASSP 2021
Subjects: Audio and Speech Processing (eess.AS)

In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes. Zero-shot learning in audio classification refers to classification problems that aim at recognizing audio instances of sound classes, which have no available training data but only semantic side information. In this paper, we address zero-shot learning by employing factored linear and nonlinear acoustic-semantic projections. We develop factored linear projections by applying rank decomposition to a bilinear model, and use nonlinear activation functions, such as tanh, to model the non-linearity between acoustic embeddings and semantic embeddings. Compared with the prior bilinear model, experimental results show that the proposed projection methods are effective for improving classification performance of zero-shot learning in audio classification.

[3]  arXiv:2011.12696 [pdf, other]
Title: Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)

Bootstrapping speech recognition on limited data resources has been an area of active research for long. The recent transition to all-neural models and end-to-end (E2E) training brought along particular challenges as these models are known to be data hungry, but also came with opportunities around language-agnostic representations derived from multilingual data as well as shared word-piece output representations across languages that share script and roots.Here, we investigate the effectiveness of different strategies to bootstrap an RNN Transducer (RNN-T) based automatic speech recognition (ASR) system in the low resource regime,while exploiting the abundant resources available in other languages as well as the synthetic audio from a text-to-speech(TTS) engine. Experiments show that the combination of a multilingual RNN-T word-piece model, post-ASR text-to-text mapping, and synthetic audio can effectively bootstrap an ASR system for a new language in a scalable fashion with little target language data.

[4]  arXiv:2011.12941 [pdf, other]
Title: Small Footprint Convolutional Recurrent Networks for Streaming Wakeword Detection
Comments: \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects: Audio and Speech Processing (eess.AS)

In this work, we propose small footprint Convolutional Recurrent Neural Network models applied to the problem of wakeword detection and augment them with scaled dot product attention. We find that false accepts compared to Convolutional Neural Network models in a 250k parameter budget can be reduced by 25% with a 10% reduction in parameter size by using CRNNs, and we can get up to 32% improvement at a 50k parameter budget with 75% reduction in parameter size compared to word-level Dense Neural Network models. We discuss solutions to the challenging problem of performing inference on streaming audio with CRNNs, as well as differences in start-end index errors and latency in comparison to CNN, DNN, and DNN-HMM models.

Cross-lists for Thu, 26 Nov 20

[5]  arXiv:2011.12536 (cross-list from cs.SD) [pdf, ps, other]
Title: Vocal Tract Length Perturbation for Text-Dependent Speaker Verification with Autoregressive Prediction Coding
Authors: Achintya kr. Sarkar, Zheng-Hua Tan (Senior Member, IEEE)
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

In this letter, we propose a vocal tract length (VTL) perturbation method for text-dependent speaker verification (TD-SV), in which a set of TD-SV systems are trained, one for each VTL factor, and score-level fusion is applied to make a final decision. Next, we explore the bottleneck (BN) feature extracted by training deep neural networks with a self-supervised objective, autoregressive predictive coding (APC), for TD-SV and compare it with the well-studied speaker-discriminant BN feature. The proposed VTL method is then applied to APC and speaker-discriminant BN features. In the end, we combine the VTL perturbation systems trained on MFCC and the two BN features in the score domain. Experiments are performed on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Results show the proposed methods significantly outperform the baselines.

[6]  arXiv:2011.12596 (cross-list from cs.SD) [pdf, other]
Title: MTCRNN: A multi-scale RNN for directed audio texture synthesis
Authors: M. Huzaifah, L. Wyse
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)

Audio textures are a subset of environmental sounds, often defined as having stable statistical characteristics within an adequately large window of time but may be unstructured locally. They include common everyday sounds such as from rain, wind, and engines. Given that these complex sounds contain patterns on multiple timescales, they are a challenge to model with traditional methods. We introduce a novel modelling approach for textures, combining recurrent neural networks trained at different levels of abstraction with a conditioning strategy that allows for user-directed synthesis. We demonstrate the model's performance on a variety of datasets, examine its performance on various metrics, and discuss some potential applications.

[7]  arXiv:2011.12649 (cross-list from cs.CL) [pdf, other]
Title: Neural Representations for Modeling Variation in English Speech
Comments: Submitted to Journal of Phonetics
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Variation in speech is often represented and investigated using phonetic transcriptions, but transcribing speech is time-consuming and error prone. To create reliable representations of speech independent from phonetic transcriptions, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and evaluate these differences by comparing them with human native-likeness judgments. We show that Transformer-based speech representations lead to significant performance gains over the use of phonetic transcriptions, and find that feature-based use of Transformer models is most effective with one or more middle layers instead of the final layer. We also demonstrate that these neural speech representations not only capture segmental differences, but also intonational and durational differences that cannot be represented by a set of discrete symbols used in phonetic transcriptions.

Replacements for Thu, 26 Nov 20

[8]  arXiv:2006.06426 (replaced) [pdf, other]
Title: Deep generative models for musical audio synthesis
Authors: M. Huzaifah, L. Wyse
Comments: This is the authors' own pre-submission version of a chapter for Handbook of Artificial Intelligence for Music: Foundations, Advanced Approaches, and Developments for Creativity, edited by Eduardo R. Miranda, for Springer
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
[9]  arXiv:1810.01248 (replaced) [pdf, other]
Title: A Lightweight Music Texture Transfer System
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[10]  arXiv:2010.03360 (replaced) [pdf, other]
Title: Interpreting Imagined Speech Waves with Machine Learning techniques
Subjects: Signal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[11]  arXiv:2010.08123 (replaced) [pdf, other]
Title: Melody Classifier with Stacked-LSTM
Authors: You Li, Zhuowen Lin
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[12]  arXiv:2011.03689 (replaced) [pdf, other]
Title: Detection and Evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems
Comments: 6 pages excluding references. Paper accepted by IEEE Spoken Language Technology (SLT) 2021
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[13]  arXiv:2011.11715 (replaced) [pdf, other]
Title: Multi-task Language Modeling for Improving Speech Recognition of Rare Words
Comments: Submitted to ICASSP 2021
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[ total of 13 entries: 1-13 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, recent, 2011, contact, help  (Access key information)