We gratefully acknowledge support from
the Simons Foundation and member institutions.

Audio and Speech Processing

New submissions

[ total of 12 entries: 1-12 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Thu, 9 Jul 20

[1]  arXiv:2007.03759 [pdf, other]
Title: Surveying Off-Board and Extra-Vehicular Monitoring and Progress Towards Pervasive Diagnostics
Comments: 21 pages; 2 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

We survey the state-of-the-art in offboard diagnostics for vehicles, their occupants, and environments, with particular focus on vibroacoustic approaches. We identify promising application areas including data-driven management for shared mobility and automated fleets, usage-based insurance, and vehicle,occupant, and environmental state and condition monitoring. We close by exploring the particular application of vibroacoustic monitoring to vehicle diagnostics and prognostics and propose the introduction of automated vehicle- and context-specific model selection as a means of improving algorithm performance, e.g. to enable smartphone-resident diagnostics. The described approach may serve as the first step in developing "universal diagnostics" utilizing artificial intelligence, with applicability extending beyond the automotive domain.

[2]  arXiv:2007.03900 [pdf, other]
Title: Streaming End-to-End Bilingual ASR Systems with Joint Language Identification
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.

[3]  arXiv:2007.04134 [pdf, other]
Title: Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision
Comments: Accepted at the Workshop on Self-supervision in Audio and Speech at ICML 2020
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.

Cross-lists for Thu, 9 Jul 20

[4]  arXiv:2007.03781 (cross-list from cs.SD) [pdf, other]
Title: Acoustic Scene Classification with Spectrogram Processing Strategies
Comments: Submitted to DCASE 2020 Workshop
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to the neural networks. In this paper, we study the problem of efficiently taking advantage of different spectrogram representations through discriminative processing strategies. There are two main contributions. The first contribution is exploring the impact of the combination of multiple spectrogram representations at different stages, which provides a meaningful reference for the effective spectrogram fusion. The second contribution is that the processing strategies in multiple frequency bands and multiple temporal frames are proposed to make fully use of a single spectrogram representation. The proposed spectrogram processing strategies can be easily transferred to any network structures. The experiments are carried out on the DCASE 2020 Task1 datasets, and the results show that our method could achieve the accuracy of 81.8% (official baseline: 54.1%) and 92.1% (official baseline: 87.3%) on the officially provided fold 1 evaluation dataset of Task1A and Task1B, respectively.

[5]  arXiv:2007.03893 (cross-list from eess.SP) [pdf, other]
Title: Multi-Resolution Beta-Divergence NMF for Blind Spectral Unmixing
Comments: 13 pages
Subjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)

Blind spectral unmixing is the problem of decomposing the spectrum of a mixed signal or image into a collection of source spectra and their corresponding activations indicating the proportion of each source present in the mixed spectrum. To perform this task, nonnegative matrix factorization (NMF) based on the $\beta$-divergence, referred to as $\beta$-NMF, is a standard and state-of-the art technique. Many NMF-based methods factorize a data matrix that is the result of a resolution trade-off between two adversarial dimensions. Two instrumental examples are (1)~audio spectral unmixing for which the frequency-by-time data matrix is computed with the short-time Fourier transform and is the result of a trade-off between the frequency resolution and the temporal resolution, and (2)~blind hyperspectral unmixing for which the wavelength-by-location data matrix is a trade-off between the number of wavelengths measured and the spatial resolution. In this paper, we propose a new NMF-based method, dubbed multi-resolution $\beta$-NMF (MR-$\beta$-NMF), to address this issue by fusing the information coming from multiple data with different resolutions in order to produce a factorization with high resolutions for all the dimensions. MR-$\beta$-NMF performs a form of nonnegative joint factorization based on the $\beta$-divergence. In order to solve this problem, we propose multiplicative updates based on a majorization-minimization algorithm. We show on numerical experiments that MR-$\beta$-NMF is able to obtain high resolutions in both dimensions for two applications: the joint-factorization of two audio spectrograms, and the hyperspectral and multispectral data fusion problem.

[6]  arXiv:2007.03931 (cross-list from cs.SD) [pdf, other]
Title: Training Sound Event Detection On A Heterogeneous Dataset
Authors: Nicolas Turpault (MULTISPEECH), Romain Serizel (MULTISPEECH)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)

Training a sound event detection algorithm on a heterogeneous dataset including both recorded and synthetic soundscapes that can have various labeling granularity is a non-trivial task that can lead to systems requiring several technical choices. These technical choices are often passed from one system to another without being questioned. We propose to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes. Some of the parameters that are usually used as default are shown to be sub-optimal.

[7]  arXiv:2007.03932 (cross-list from cs.SD) [pdf, other]
Title: Improving Sound Event Detection In Domestic Environments Using Sound Separation
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)

Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the classifier level. We propose to use sound separation as a pre-processing for sound event detection. In this paper we start from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline. We explore different methods to combine separated sound sources and the original mixture within the sound event detection. Furthermore, we investigate the impact of adapting the sound separation model to the sound event detection data on both the sound separation and the sound event detection.

Replacements for Thu, 9 Jul 20

[8]  arXiv:2001.02480 (replaced) [pdf, other]
Title: Audio Inpainting: Revisited and Reweighted
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[9]  arXiv:2006.15406 (replaced) [pdf, other]
Title: Listen carefully and tell: an audio captioning system based on residual learning and gammatone audio representation
Comments: Submitted to DCASE2020 Workshop, Workshop on Detection and Classification of Acoustic Scenes and Events
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[10]  arXiv:2007.03001 (replaced) [pdf, other]
Title: Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[11]  arXiv:2006.08386 (replaced) [pdf, other]
Title: COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations
Comments: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austria
Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[12]  arXiv:2007.02126 (replaced) [pdf, other]
Title: Deep Graph Random Process for Relational-Thinking-Based Speech Recognition
Comments: Accepted at ICML 2020
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[ total of 12 entries: 1-12 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, recent, 2007, contact, help  (Access key information)