We gratefully acknowledge support from
the Simons Foundation and member institutions.

Audio and Speech Processing

New submissions

[ total of 14 entries: 1-14 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 30 Jul 21

[1]  arXiv:2107.13616 [pdf, other]
Title: Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers
Subjects: Audio and Speech Processing (eess.AS); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD)

There are many important applications for detecting and localizing specific sound events within long, untrimmed documents including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train deep learning models. In this paper, we propose novel approaches to few-shot sound event detection utilizing region proposals and the Perceiver architecture, which is capable of accurately localizing sound events with very few examples of each class of interest. Motivated by a lack of suitable benchmark datasets for few-shot audio event detection, we generate and evaluate on two novel episodic rare sound event datasets: one using clips of celebrity speech as the sound event, and the other using environmental sounds. Our highest performing proposed few-shot approaches achieve 0.575 and 0.672 F1-score, respectively, with 5-shot 5-way tasks on these two datasets. These represent absolute improvements of 0.200 and 0.234 over strong proposal-free few-shot sound event detection baselines.

[2]  arXiv:2107.13634 [pdf, other]
Title: Neural Remixer: Learning to Remix Music with Interactive Control
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

The task of manipulating the level and/or effects of individual instruments to recompose a mixture of recording, or remixing, is common across a variety of applications such as music production, audio-visual post-production, podcasts, and more. This process, however, traditionally requires access to individual source recordings, restricting the creative process. To work around this, source separation algorithms can separate a mixture into its respective components. Then, a user can adjust their levels and mix them back together. This two-step approach, however, still suffers from audible artifacts and motivates further work. In this work, we seek to learn to remix music directly. To do this, we propose two neural remixing architectures that extend Conv-TasNet to either remix via a) source estimates directly or b) their latent representations. Both methods leverage a remixing data augmentation scheme as well as a mixture reconstruction loss to achieve an end-to-end separation and remixing process. We evaluate our methods using the Slakh and MUSDB datasets and report both source separation performance and the remixing quality. Our results suggest learning-to-remix significantly outperforms a strong separation baseline, is particularly useful for small changes, and can provide interactive user-controls.

Cross-lists for Fri, 30 Jul 21

[3]  arXiv:2107.13591 (cross-list from physics.med-ph) [pdf, other]
Title: Detection of squawks in respiratory sounds of mechanically ventilated COVID-19 patients
Comments: 5 pages, 6 figures
Subjects: Medical Physics (physics.med-ph); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Mechanically ventilated patients typically exhibit abnormal respiratory sounds. Squawks are short inspiratory adventitious sounds that may occur in patients with pneumonia, such as COVID-19 patients. In this work we devised a method for squawk detection in mechanically ventilated patients by developing algorithms for respiratory cycle estimation, squawk candidate identification, feature extraction, and clustering. The best classifier reached an F1 of 0.48 at the sound file level and an F1 of 0.66 at the recording session level. These preliminary results are promising, as they were obtained in noisy environments. This method will give health professionals a new feature to assess the potential deterioration of critically ill patients.

[4]  arXiv:2107.13617 (cross-list from cs.SD) [pdf, other]
Title: Pitch-Informed Instrument Assignment Using a Deep Convolutional Network with Multiple Kernel Shapes
Comments: 4 figures, 4 tables and 7 pages. Accepted for publication at ISMIR Conference 2021
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)

This paper proposes a deep convolutional neural network for performing note-level instrument assignment. Given a polyphonic multi-instrumental music signal along with its ground truth or predicted notes, the objective is to assign an instrumental source for each note. This problem is addressed as a pitch-informed classification task where each note is analysed individually. We also propose to utilise several kernel shapes in the convolutional layers in order to facilitate learning of efficient timbre-discriminative feature maps. Experiments on the MusicNet dataset using 7 instrument classes show that our approach is able to achieve an average F-score of 0.904 when the original multi-pitch annotations are used as the pitch information for the system, and that it also excels if the note information is provided using third-party multi-pitch estimation algorithms. We also include ablation studies investigating the effects of the use of multiple kernel shapes and comparing different input representations for the audio and the note-related information.

[5]  arXiv:2107.13832 (cross-list from cs.SD) [pdf, other]
Title: Blind Room Parameter Estimation Using Multiple-Multichannel Speech Recordings
Comments: Accepted In WASPAA 2021 ( IEEE Workshop on Applications of Signal Processing to Audio and Acoustics )
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics. In this paper, we study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room in a blind fashion, based on two-channel noisy speech recordings from multiple, unknown source-receiver positions. A novel convolutional neural network architecture leveraging both single- and inter-channel cues is proposed and trained on a large, realistic simulated dataset. Results on both simulated and real data show that using multiple observations in one room significantly reduces estimation errors and variances on all target quantities, and that using two channels helps the estimation of surface and volume. The proposed model outperforms a recently proposed blind volume estimation method on the considered datasets.

[6]  arXiv:2107.13969 (cross-list from cs.CY) [pdf, other]
Title: Significance of Speaker Embeddings and Temporal Context for Depression Detection
Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Depression detection from speech has attracted a lot of attention in recent years. However, the significance of speaker-specific information in depression detection has not yet been explored. In this work, we analyze the significance of speaker embeddings for the task of depression detection from speech. Experimental results show that the speaker embeddings provide important cues to achieve state-of-the-art performance in depression detection. We also show that combining conventional OpenSMILE and COVAREP features, which carry complementary information, with speaker embeddings further improves the depression detection performance. The significance of temporal context in the training of deep learning models for depression detection is also analyzed in this paper.

[7]  arXiv:2107.14009 (cross-list from cs.SD) [pdf, other]
Title: PKSpell: Data-Driven Pitch Spelling and Key Signature Estimation
Comments: International Society for Music Information Retrieval Conference (ISMIR), Nov 2021, Online, India
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

We present PKSpell: a data-driven approach for the joint estimation of pitch spelling and key signatures from MIDI files. Both elements are fundamental for the production of a full-fledged musical score and facilitate many MIR tasks such as harmonic analysis, section identification, melodic similarity, and search in a digital music library. We design a deep recurrent neural network model that only requires information readily available in all kinds of MIDI files, including performances, or other symbolic encodings. We release a model trained on the ASAP dataset. Our system can be used with these pre-trained parameters and is easy to integrate into a MIR pipeline. We also propose a data augmentation procedure that helps retraining on small datasets. PKSpell achieves strong key signature estimation performance on a challenging dataset. Most importantly, this model establishes a new state-of-the-art performance on the MuseData pitch spelling dataset without retraining.

[8]  arXiv:2107.14028 (cross-list from cs.SD) [pdf, other]
Title: Estimating Respiratory Rate From Breath Audio Obtained Through Wearable Microphones
Comments: International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2021
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Respiratory rate (RR) is a clinical metric used to assess overall health and physical fitness. An individual's RR can change from their baseline due to chronic illness symptoms (e.g., asthma, congestive heart failure), acute illness (e.g., breathlessness due to infection), and over the course of the day due to physical exhaustion during heightened exertion. Remote estimation of RR can offer a cost-effective method to track disease progression and cardio-respiratory fitness over time. This work investigates a model-driven approach to estimate RR from short audio segments obtained after physical exertion in healthy adults. Data was collected from 21 individuals using microphone-enabled, near-field headphones before, during, and after strenuous exercise. RR was manually annotated by counting perceived inhalations and exhalations. A multi-task Long-Short Term Memory (LSTM) network with convolutional layers was implemented to process mel-filterbank energies, estimate RR in varying background noise conditions, and predict heavy breathing, indicated by an RR of more than 25 breaths per minute. The multi-task model performs both classification and regression tasks and leverages a mixture of loss functions. It was observed that RR can be estimated with a concordance correlation coefficient (CCC) of 0.76 and a mean squared error (MSE) of 0.2, demonstrating that audio can be a viable signal for approximating RR.

[9]  arXiv:2107.14132 (cross-list from cs.SD) [pdf, other]
Title: Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection
Comments: Submitted to ASVspoof 2021 Workshop
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

In this paper, we provide a series of multi-tasking benchmarks for simultaneously detecting spoofing at the segmental and utterance levels in the PartialSpoof database. First, we propose the SELCNN network, which inserts squeeze-and-excitation (SE) blocks into a light convolutional neural network (LCNN) to enhance the capacity of hidden feature selection. Then, we implement multi-task learning (MTL) frameworks with SELCNN followed by bidirectional long short-term memory (Bi-LSTM) as the basic model. We discuss MTL in PartialSpoof in terms of architecture (uni-branch/multi-branch) and training strategies (from-scratch/warm-up) step-by-step. Experiments show that the multi-task model performs better than single-task models. Also, in MTL, binary-branch architecture more adequately utilizes information from two levels than a uni-branch model. For the binary-branch architecture, fine-tuning a warm-up model works better than training from scratch. Models can handle both segment-level and utterance-level predictions simultaneously overall under binary-branch multi-task architecture. Furthermore, the multi-task model trained by fine-tuning a segmental warm-up model performs relatively better at both levels except on the evaluation set for segmental detection. Segmental detection should be explored further.

Replacements for Fri, 30 Jul 21

[10]  arXiv:2105.02469 (replaced) [pdf, other]
Title: Point Cloud Audio Processing
Comments: Accepted at WASPAA 2021, Code: this https URL
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[11]  arXiv:2012.05680 (replaced) [pdf, other]
Title: Direct multimodal few-shot learning of speech and images
Comments: Accepted to Interspeech 2021
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[12]  arXiv:2105.07596 (replaced) [pdf, other]
Title: Sound Event Detection with Adaptive Frequency Selection
Comments: Accepted by IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2021
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[13]  arXiv:2107.04954 (replaced) [pdf, other]
Title: ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data
Comments: Accepted in ACMMM 21. Camera ready version
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[14]  arXiv:2107.08661 (replaced) [pdf, other]
Title: Translatotron 2: Robust direct speech-to-speech translation
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[ total of 14 entries: 1-14 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, recent, 2107, contact, help  (Access key information)