We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 8 entries: 1-8 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 3 Feb 23

[1]  arXiv:2302.00868 [pdf, other]
Title: Speech Enhancement for Virtual Meetings on Cellular Networks
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

We study speech enhancement using deep learning (DL) for virtual meetings on cellular devices, where transmitted speech has background noise and transmission loss that affects speech quality. Since the Deep Noise Suppression (DNS) Challenge dataset does not contain practical disturbance, we collect a transmitted DNS (t-DNS) dataset using Zoom Meetings over T-Mobile network. We select two baseline models: Demucs and FullSubNet. The Demucs is an end-to-end model that takes time-domain inputs and outputs time-domain denoised speech, and the FullSubNet takes time-frequency-domain inputs and outputs the energy ratio of the target speech in the inputs. The goal of this project is to enhance the speech transmitted over the cellular networks using deep learning models.

[2]  arXiv:2302.01090 [pdf, other]
Title: Goniometers are a Powerful Acoustic Feature for Music Information Retrieval Tasks
Authors: Tim Ziemer
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)

Goniometers, also known as Phase Scopes or Vector Scopes, are audio metering tools that help music producers and mixing engineers monitor spatial aspects of a music mix, such as the stereo panorama, the width of single sources, the amount and diffuseness of reverberation as well as phase cancellations that may occur on the sweet-spot and in a mono-mixdown. In addition, they implicitly inform about the dynamics of the sound. Self-organizing maps trained with a goniometer, are consulted to explore the usefulness of this acoustic feature for music information retrieval tasks. One can see that goniometers are able to classify different genres and cluster a single album. The advantage of goniometers is the causality: Music producers and mixing engineers consciously consult goniometers to reach their desired sound, which is not the case for other acoustic features, from Zero-Crossing Rate to Mel-Frequency Cepstral Coefficients.

Cross-lists for Fri, 3 Feb 23

[3]  arXiv:2302.00765 (cross-list from cs.CL) [pdf, other]
Title: Visually Grounded Keyword Detection and Localisation for Low-Resource Languages
Comments: PhD dissertation, University of Stellenbosch, 108 pages, submitted and accepted 2023
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

This study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech. The study focusses on two main research questions: (1) Is keyword localisation possible with VGS models and (2) Can keyword localisation be done cross-lingually in a real low-resource setting? Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%. A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation. The cross-lingual model obtains a precision of 16% in actual keyword localisation and this performance can be improved by initialising from a model pretrained on English data. The study presents a detailed analysis of the model's success and failure modes and highlights the challenges of using VGS models for keyword localisation in low-resource settings.

[4]  arXiv:2302.00836 (cross-list from cs.CL) [pdf, other]
Title: Improving Rare Words Recognition through Homophone Extension and Unified Writing for Low-resource Cantonese Speech Recognition
Comments: The 13th International Symposium on Chinese Spoken Language Processing (ISCSLP 2022)
Journal-ref: Published in ISCSLP 2022
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Homophone characters are common in tonal syllable-based languages, such as Mandarin and Cantonese. The data-intensive end-to-end Automatic Speech Recognition (ASR) systems are more likely to mis-recognize homophone characters and rare words under low-resource settings. For the problem of lowresource Cantonese speech recognition, this paper presents a novel homophone extension method to integrate human knowledge of the homophone lexicon into the beam search decoding process with language model re-scoring. Besides, we propose an automatic unified writing method to merge the variants of Cantonese characters and standardize speech annotation guidelines, which enables more efficient utilization of labeled utterances by providing more samples for the merged characters. We empirically show that both homophone extension and unified writing improve the recognition performance significantly on both in-domain and out-of-domain test sets, with an absolute Character Error Rate (CER) decrease of around 5% and 18%.

Replacements for Fri, 3 Feb 23

[5]  arXiv:2205.08459 (replaced) [pdf, other]
Title: Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. The current version includes 36 pages, 8 figures, and 3 tables
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[6]  arXiv:2212.05301 (replaced) [pdf, other]
Title: Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Comments: Accepted by AAAI2023
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[7]  arXiv:2302.00286 (replaced) [pdf, other]
Title: Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training
Comments: arXiv admin note: text overlap with arXiv:2206.10805
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[8]  arXiv:2301.10047 (replaced) [pdf, other]
Title: DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model
Comments: 13 pages, 3 figures
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
[ total of 8 entries: 1-8 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2302, contact, help  (Access key information)