We gratefully acknowledge support from
the Simons Foundation and member institutions.

Audio and Speech Processing

Authors and titles for eess.AS in Sep 2023

[ total of 465 entries: 1-465 ]
[ showing 465 entries per page: fewer | more ]
[1]  arXiv:2309.00169 [pdf, other]
Title: RepCodec: A Speech Representation Codec for Speech Tokenization
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[2]  arXiv:2309.00223 [pdf, other]
Title: The FruitShell French synthesis system at the Blizzard 2023 Challenge
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[3]  arXiv:2309.00376 [pdf, other]
Title: Remixing-based Unsupervised Source Separation from Scratch
Comments: Interspeech2023, 5pages, 2figures, 2tables
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[4]  arXiv:2309.00424 [pdf, other]
Title: Learning Speech Representation From Contrastive Token-Acoustic Pretraining
Comments: Accepted by ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[5]  arXiv:2309.00647 [pdf, other]
Title: Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data
Comments: Interspeech 2023
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[6]  arXiv:2309.01108 [pdf, other]
Title: Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?
Comments: Accepted to IEEE ICASSP Workshops 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[7]  arXiv:2309.01142 [pdf, other]
Title: MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling
Comments: This work was submitted on April 10, 2022 and accepted on August 29, 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[8]  arXiv:2309.01164 [pdf, other]
Title: Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[9]  arXiv:2309.01513 [pdf, other]
Title: RGI-Net: 3D Room Geometry Inference from Room Impulse Responses in the Absence of First-order Echoes
Comments: 5 pages, 3 figures, 3 tables
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[10]  arXiv:2309.01535 [pdf, other]
Title: Single-Channel Speech Enhancement with Deep Complex U-Networks and Probabilistic Latent Space Models
Journal-ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[11]  arXiv:2309.02265 [pdf, other]
Title: PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[12]  arXiv:2309.02285 [pdf, other]
Title: PromptTTS 2: Describing and Generating Voices with Text Prompt
Comments: Demo page: this https URL
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[13]  arXiv:2309.02393 [pdf, other]
Title: In-Ear-Voice: Towards Milli-Watt Audio Enhancement With Bone-Conduction Microphones for In-Ear Sensing Platforms
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
[14]  arXiv:2309.02418 [pdf, other]
Title: Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
Comments: Accepted by INTERSPEECH 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[15]  arXiv:2309.02432 [pdf, other]
Title: Employing Real Training Data for Deep Noise Suppression
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[16]  arXiv:2309.02466 [pdf, ps, other]
Title: Minimal Effective Theory for Phonotactic Memory: Capturing Local Correlations due to Errors in Speech
Comments: 16 pages; 7 figs
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[17]  arXiv:2309.02539 [pdf, other]
Title: A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
Comments: Accepted to the IEEE Open Journal of Signal Processing (ICASSP 2024 Track)
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
[18]  arXiv:2309.02567 [pdf, other]
Title: Symbolic Music Representations for Classification Tasks: A Systematic Evaluation
Comments: To be published in the Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy
Journal-ref: Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
[19]  arXiv:2309.02592 [pdf, other]
Title: BWSNet: Automatic Perceptual Assessment of Audio Signals
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[20]  arXiv:2309.02730 [pdf, other]
Title: Stylebook: Content-Dependent Speaking Style Modeling for Any-to-Any Voice Conversion using Only Speech Data
Comments: 5 pages, 2 figures, 2 tables
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[21]  arXiv:2309.02743 [pdf, other]
Title: MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023
Comments: 6 pages
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[22]  arXiv:2309.03019 [pdf, other]
Title: Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation
Authors: Danwei Cai, Ming Li
Subjects: Audio and Speech Processing (eess.AS)
[23]  arXiv:2309.03149 [pdf, other]
Title: Real-time auralization for performers on virtual stages
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[24]  arXiv:2309.03199 [pdf, other]
Title: Matcha-TTS: A fast TTS architecture with conditional flow matching
Comments: 5 pages, 3 figures. Final version, accepted to IEEE ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
[25]  arXiv:2309.03337 [pdf, other]
Title: Leveraging Geometrical Acoustic Simulations of Spatial Room Impulse Responses for Improved Sound Event Detection and Localization
Comments: 5 pages, 3 figures, 3 tables, presented in the Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023)
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[26]  arXiv:2309.03486 [pdf, other]
Title: Simulating room transfer functions between transducers mounted on audio devices using a modified image source method
Comments: The following article has been submitted to the Journal of the Acoustical Society of America (JASA). After it is published, it will be found at this http URL
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[27]  arXiv:2309.03684 [pdf, other]
Title: Causal Signal-Based DCCRN with Overlapped-Frame Prediction for Online Speech Enhancement
Journal-ref: Proc. INTERSPEECH 2023, 4039-4043 (2023)
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[28]  arXiv:2309.04265 [pdf, other]
Title: Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification
Comments: 5 pages, 2 figures, accepted by ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS)
[29]  arXiv:2309.04516 [pdf, ps, other]
Title: End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[30]  arXiv:2309.04628 [pdf, other]
Title: Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[31]  arXiv:2309.05027 [pdf, other]
Title: VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
Comments: 4 figure, 5 pages, accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD)
[32]  arXiv:2309.05057 [pdf, other]
Title: Gray Jedi MVDR Post-filtering
Comments: \c{opyright} 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[33]  arXiv:2309.05248 [pdf, other]
Title: Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
Comments: 4 pages 1 reference page, ICASSP format
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[34]  arXiv:2309.05384 [pdf, other]
Title: Towards generalisable and calibrated synthetic speech detection with self-supervised representations
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[35]  arXiv:2309.05423 [pdf, other]
Title: Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[36]  arXiv:2309.05455 [pdf, other]
Title: Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Subjects: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
[37]  arXiv:2309.05777 [pdf, ps, other]
Title: Smartwatch-derived Acoustic Markers for Deficits in Cognitively Relevant Everyday Functioning
Journal-ref: 2023 IEEE International Conference on Digital Health (ICDH)
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[38]  arXiv:2309.06014 [pdf, other]
Title: Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?
Comments: To appear in ICASSP 2024. code on github: this https URL
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[39]  arXiv:2309.06096 [pdf, other]
Title: iPhonMatchNet: Zero-Shot User-Defined Keyword Spotting Using Implicit Acoustic Echo Cancellation
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[40]  arXiv:2309.06183 [pdf, other]
Title: Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments
Comments: Accepted to IEEE/ACM TASLP
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[41]  arXiv:2309.06531 [pdf, other]
Title: ASPED: An Audio Dataset for Detecting Pedestrians
Comments: 4+1 pages, ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[42]  arXiv:2309.06572 [pdf, other]
Title: Addressing the Blind Spots in Spoken Language Processing
Authors: Amit Moryossef
Comments: 5 pages
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[43]  arXiv:2309.06661 [pdf, ps, other]
Title: Sound field decomposition based on two-stage neural networks
Comments: 31 pages, 16 figures
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
[44]  arXiv:2309.06934 [pdf, other]
Title: VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[45]  arXiv:2309.06946 [pdf, ps, other]
Title: Reorganization of the auditory-perceptual space across the human vocal range
Journal-ref: Proceedings of the 20th International Congress of Phonetic Sciences (2023) 560-564
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Populations and Evolution (q-bio.PE)
[46]  arXiv:2309.07043 [pdf, other]
Title: A Flexible Online Framework for Projection-Based STFT Phase Retrieval
Comments: Submitted to ICASSP 24
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[47]  arXiv:2309.07081 [pdf, other]
Title: Can Whisper perform speech-based in-context learning?
Comments: Accepted by ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[48]  arXiv:2309.07164 [pdf, other]
Title: Hybrid ASR for Resource-Constrained Robots: HMM - Deep Learning Fusion
Comments: To be published in IEEE Access, 9 pages, 14 figures, Received valuable support from CCBD PESU, for associated code, see this https URL
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[49]  arXiv:2309.07287 [pdf, other]
Title: Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis
Comments: Accepted to Interspeech 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[50]  arXiv:2309.07369 [pdf, other]
Title: Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[51]  arXiv:2309.07372 [pdf, other]
Title: Training Audio Captioning Models without Audio
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[52]  arXiv:2309.07377 [pdf, other]
Title: Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Comments: Accepted in ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[53]  arXiv:2309.07385 [pdf, other]
Title: Multi-dimensional Speech Quality Assessment in Crowdsourcing
Comments: arXiv admin note: substantial text overlap with arXiv:2303.06566
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[54]  arXiv:2309.07414 [pdf, other]
Title: PromptASR for contextualized ASR with controllable style
Comments: Proc. ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[55]  arXiv:2309.07466 [pdf, other]
Title: Codec Data Augmentation for Time-domain Heart Sound Classification
Comments: Accepted by ICAICTA 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[56]  arXiv:2309.07498 [pdf, other]
Title: Hierarchical Metadata Information Constrained Self-Supervised Learning for Anomalous Sound Detection Under Domain Shift
Comments: To appear at ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[57]  arXiv:2309.07586 [pdf, other]
Title: Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion
Comments: Accepted in Interspeech 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[58]  arXiv:2309.07592 [pdf, other]
Title: StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings
Comments: Accepted in 12th Speech Synthesis Workshop (SSW), Satellite event in Interspeech 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[59]  arXiv:2309.07648 [pdf, other]
Title: Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer
Comments: Accepted in INTERSPEECH 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[60]  arXiv:2309.07757 [pdf, other]
Title: Complexity Scaling for Speech Denoising
Comments: Submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[61]  arXiv:2309.07803 [pdf, other]
Title: SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias
Comments: Accepted by ICME 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[62]  arXiv:2309.07828 [pdf, other]
Title: EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[63]  arXiv:2309.07925 [pdf, other]
Title: Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
Comments: 5 pages, 4 figures
Journal-ref: The 31st ACM International Conference on Multimedia (MM'23), 2023
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
[64]  arXiv:2309.07927 [pdf, ps, other]
Title: Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[65]  arXiv:2309.07937 [pdf, other]
Title: Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[66]  arXiv:2309.08005 [pdf, ps, other]
Title: Efficient Face Detection with Audio-Based Region Proposals for Human-Robot Interactions
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Image and Video Processing (eess.IV)
[67]  arXiv:2309.08007 [pdf, ps, other]
Title: DiariST: Streaming Speech Translation with Speaker Diarization
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[68]  arXiv:2309.08023 [pdf, other]
Title: USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models
Comments: 5 pages, 2 figures, 4 tables
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[69]  arXiv:2309.08030 [pdf, other]
Title: AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement
Comments: extended version for the accepted paper at ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[70]  arXiv:2309.08060 [pdf, other]
Title: DDSP-SFX: Acoustically-guided sound effects generation with differentiable digital signal processing
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[71]  arXiv:2309.08105 [pdf, other]
Title: Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[72]  arXiv:2309.08131 [pdf, other]
Title: t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability
Comments: 5 pages, 2 figures, submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[73]  arXiv:2309.08140 [pdf, other]
Title: PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[74]  arXiv:2309.08141 [pdf, other]
Title: Audio Difference Learning for Audio Captioning
Comments: submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
[75]  arXiv:2309.08153 [pdf, other]
Title: Fine-tune the pretrained ATST model for sound event detection
Comments: 5 pages, 3 figures, camera-ready version for ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[76]  arXiv:2309.08157 [pdf, other]
Title: RVAE-EM: Generative speech dereverberation based on recurrent variational auto-encoder and convolutive transfer function
Comments: Submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[77]  arXiv:2309.08255 [pdf, other]
Title: Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech
Comments: Accepted at ICONIP 2023
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[78]  arXiv:2309.08263 [pdf, other]
Title: Improving Voice Conversion for Dissimilar Speakers Using Perceptual Losses
Comments: Accepted in The German Annual Conference on Acoustics 2023 (DAGA)
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[79]  arXiv:2309.08279 [pdf, other]
Title: Improving Short Utterance Anti-Spoofing with AASIST2
Comments: 5 pages, 2 figures, accepted by ICASSP
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[80]  arXiv:2309.08285 [pdf, other]
Title: One-Class Knowledge Distillation for Spoofing Speech Detection
Comments: submitted to icassp 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[81]  arXiv:2309.08290 [pdf, other]
Title: Head-Related Transfer Function Interpolation with a Spherical CNN
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[82]  arXiv:2309.08294 [pdf, other]
Title: Speech-dependent Modeling of Own Voice Transfer Characteristics for In-ear Microphones in Hearables
Comments: Presented at Forum Acusticum 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[83]  arXiv:2309.08295 [pdf, other]
Title: A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism
Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[84]  arXiv:2309.08320 [pdf, other]
Title: Diff-SV: A Unified Hierarchical Framework for Noise-Robust Speaker Verification Using Score-Based Diffusion Probabilistic Models
Comments: 5 pages, 2 figures, accepted for ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[85]  arXiv:2309.08348 [pdf, other]
Title: The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction
Comments: 5 pages, 4 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[86]  arXiv:2309.08355 [pdf, other]
Title: Semi-supervised Sound Event Detection with Local and Global Consistency Regularization
Comments: submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[87]  arXiv:2309.08357 [pdf, other]
Title: Audio-free Prompt Tuning for Language-Audio Models
Comments: submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[88]  arXiv:2309.08377 [pdf, other]
Title: DiaCorrect: Error Correction Back-end For Speaker Diarization
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[89]  arXiv:2309.08436 [pdf, other]
Title: Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition
Comments: Accepted at ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Machine Learning (stat.ML)
[90]  arXiv:2309.08454 [pdf, other]
Title: Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[91]  arXiv:2309.08489 [pdf, other]
Title: Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
[92]  arXiv:2309.08561 [pdf, other]
Title: Open-vocabulary Keyword-spotting with Adaptive Instance Normalization
Comments: Under Review
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[93]  arXiv:2309.08684 [pdf, other]
Title: Music Source Separation Based on a Lightweight Deep Learning Framework (DTTNET: DUAL-PATH TFC-TDF UNET)
Comments: Accepted for ICASSP 2024. Additional experiments can be found in the published version on IEEE Xplore
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[94]  arXiv:2309.08730 [pdf, other]
Title: MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Journal-ref: 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
[95]  arXiv:2309.08804 [pdf, other]
Title: Stack-and-Delay: a new codebook pattern for music generation
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[96]  arXiv:2309.08828 [pdf, other]
Title: Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[97]  arXiv:2309.08876 [pdf, ps, other]
Title: Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[98]  arXiv:2309.09028 [pdf, other]
Title: Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions
Comments: Paper in submission
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[99]  arXiv:2309.09180 [pdf, other]
Title: Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture
Comments: Accepted by ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[100]  arXiv:2309.09220 [pdf, other]
Title: Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[101]  arXiv:2309.09262 [pdf, other]
Title: PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Comments: Accepted by ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[102]  arXiv:2309.09270 [src]
Title: Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning
Comments: We found the results are got from some wrong experimental settings. We needs new experiments
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[103]  arXiv:2309.09443 [pdf, other]
Title: Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter
Comments: Submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[104]  arXiv:2309.09493 [pdf, other]
Title: HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[105]  arXiv:2309.09510 [pdf, ps, other]
Title: Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
Comments: To appear in the proceedings of ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[106]  arXiv:2309.09546 [pdf, other]
Title: Training dynamic models using early exits for automatic speech recognition on resource-constrained devices
Comments: Accepted at the ICASSP Workshop Self-supervision in Audio, Speech and Beyond 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[107]  arXiv:2309.09548 [pdf, other]
Title: Utilizing Whisper to Enhance Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[108]  arXiv:2309.09630 [pdf, other]
Title: Refining DNN-based Mask Estimation using CGMM-based EM Algorithm for Multi-channel Noise Reduction
Journal-ref: Proc. Interspeech 2022, 2923-2927 (2022)
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[109]  arXiv:2309.09677 [pdf, other]
Title: Single and Few-step Diffusion for Generative Speech Enhancement
Comments: copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
[110]  arXiv:2309.09836 [pdf, other]
Title: RECAP: Retrieval-Augmented Audio Captioning
Comments: ICASSP 2024. Code and data: this https URL
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
[111]  arXiv:2309.09920 [pdf, other]
Title: Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
[112]  arXiv:2309.09950 [pdf, other]
Title: Investigating End-to-End ASR Architectures for Long Form Audio Transcription
Comments: PrePrint. Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[113]  arXiv:2309.09996 [pdf, other]
Title: Improving Speech Recognition for African American English With Audio Classification
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[114]  arXiv:2309.10089 [pdf, other]
Title: HTEC: Human Transcription Error Correction
Comments: 13 pages, 4 figures, 11 tables, AMLC 2023
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
[115]  arXiv:2309.10299 [pdf, other]
Title: Using fine-tuning and min lookahead beam search to improve Whisper
Comments: 8 pages, submitted to IEEE ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[116]  arXiv:2309.10455 [pdf, other]
Title: Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement
Comments: Submmited to IEEE/ACM Transactions on Audio, Speech and Language Processing. arXiv admin note: text overlap with arXiv:2305.14933
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[117]  arXiv:2309.10524 [pdf, other]
Title: Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition
Comments: Submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[118]  arXiv:2309.10537 [pdf, other]
Title: FoleyGen: Visually-Guided Audio Generation
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM); Sound (cs.SD)
[119]  arXiv:2309.10605 [pdf, other]
Title: An Active Noise Control System Based on Soundfield Interpolation Using a Physics-informed Neural Network
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[120]  arXiv:2309.10707 [pdf, other]
Title: Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[121]  arXiv:2309.10787 [pdf, other]
Title: AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Comments: Accepted to ICASSP 2024; Evaluation Code: this https URL Submission Platform: this https URL
Subjects: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
[122]  arXiv:2309.10795 [pdf, other]
Title: Exploring Speech Enhancement for Low-resource Speech Synthesis
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS)
[123]  arXiv:2309.10917 [pdf, other]
Title: End-to-End Speech Recognition Contextualization with Large Language Models
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[124]  arXiv:2309.10922 [pdf, other]
Title: Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition
Comments: Preprint. Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[125]  arXiv:2309.11014 [pdf, ps, other]
Title: Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech
Comments: 4 pages, 6 tables, accepted in APSIPA-ASC 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[126]  arXiv:2309.11059 [pdf, other]
Title: Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[127]  arXiv:2309.11210 [pdf, other]
Title: Speak While You Think: Streaming Speech Synthesis During Text Generation
Comments: Under review for ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[128]  arXiv:2309.11243 [pdf, other]
Title: Joint Minimum Processing Beamforming and Near-end Listening Enhancement
Comments: Accepted at IEEE ICASSP 2024 Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[129]  arXiv:2309.11327 [pdf, other]
Title: Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition
Comments: 6 pages, submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[130]  arXiv:2309.11487 [pdf, other]
Title: A Neural TTS System with Parallel Prosody Transfer from Unseen Speakers
Comments: Presented at Interspeech 2023
Journal-ref: Proc. INTERSPEECH 2023, 4853-4857 (2023)
Subjects: Audio and Speech Processing (eess.AS)
[131]  arXiv:2309.11730 [pdf, other]
Title: Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition
Comments: submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[132]  arXiv:2309.11756 [pdf, other]
Title: Sparsely Shared LoRA on Whisper for Child Speech Recognition
Comments: Accepted by ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[133]  arXiv:2309.11768 [pdf, other]
Title: CoMFLP: Correlation Measure based Fast Search on ASR Layer Pruning
Comments: Accepted by Interspeech 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[134]  arXiv:2309.11827 [pdf, other]
Title: The Impact of Silence on Speech Anti-Spoofing
Comments: 16 pages, 9 figures, 13 tables
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[135]  arXiv:2309.11922 [pdf, other]
Title: Cluster-based pruning techniques for audio data
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[136]  arXiv:2309.11976 [pdf, other]
Title: Multi-Channel MOSRA: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and a Teacher Model
Comments: Accepted at ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[137]  arXiv:2309.12065 [pdf, other]
Title: Is the Ideal Ratio Mask Really the Best? -- Exploring the Best Extraction Performance and Optimal Mask of Mask-based Beamformers
Authors: Atsuo Hiroe (1), Katsutoshi Itoyama (1 and 2), Kazuhiro Nakadai (2) ((1) Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology, Tokyo, Japan, (2) Honda Research Institute Japan Co., Ltd., Saitama, Japan)
Comments: Accepted in APSIPA 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[138]  arXiv:2309.12121 [pdf, other]
Title: A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement
Comments: 13 pages, 9 figures
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[139]  arXiv:2309.12553 [pdf, other]
Title: ICASSP 2023 Acoustic Echo Cancellation Challenge
Comments: arXiv admin note: substantial text overlap with arXiv:2202.13290, arXiv:2009.04972
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[140]  arXiv:2309.12581 [pdf, other]
Title: Sampling-Frequency-Independent Universal Sound Separation
Comments: Submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[141]  arXiv:2309.12608 [pdf, other]
Title: SPGM: Prioritizing Local Features for enhanced speech separation performance
Comments: This paper was accepted by ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[142]  arXiv:2309.12656 [pdf, other]
Title: NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization
Comments: 5 pages, 5 figures, Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[143]  arXiv:2309.12712 [pdf, other]
Title: Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[144]  arXiv:2309.12714 [pdf, other]
Title: Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[145]  arXiv:2309.12763 [pdf, other]
Title: Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
Comments: 5 pages, 4 figures, ICASSP24
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[146]  arXiv:2309.12766 [pdf, other]
Title: A Study on Incorporating Whisper for Robust Speech Assessment
Comments: Accepted to IEEE ICME 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[147]  arXiv:2309.12792 [pdf, other]
Title: DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[148]  arXiv:2309.12914 [pdf, other]
Title: VIC-KD: Variance-Invariance-Covariance Knowledge Distillation to Make Keyword Spotting More Robust Against Adversarial Attacks
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[149]  arXiv:2309.12963 [pdf, ps, other]
Title: Massive End-to-end Models for Short Search Queries
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[150]  arXiv:2309.13018 [pdf, other]
Title: Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[151]  arXiv:2309.13029 [pdf, other]
Title: Memory-augmented conformer for improved end-to-end long-form ASR
Journal-ref: Proc. INTERSPEECH 2023, 2218--2222
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
[152]  arXiv:2309.13102 [pdf, other]
Title: Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR
Comments: In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023
Subjects: Audio and Speech Processing (eess.AS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Sound (cs.SD)
[153]  arXiv:2309.13253 [pdf, other]
Title: Contrastive Speaker Embedding With Sequential Disentanglement
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[154]  arXiv:2309.13504 [pdf, other]
Title: Attention Is All You Need For Blind Room Volume Estimation
Comments: 5 pages, 4 figures, to be published in proceedings of ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[155]  arXiv:2309.13537 [pdf, other]
Title: Speech enhancement with frequency domain auto-regressive modeling
Comments: 10 pages
Journal-ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 2023
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
[156]  arXiv:2309.13605 [pdf, other]
Title: Efficient Black-Box Speaker Verification Model Adaptation with Reprogramming and Backend Learning
Authors: Jingyu Li, Tan Lee
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[157]  arXiv:2309.13650 [pdf, ps, other]
Title: Cross-modal Alignment with Optimal Transport for CTC-based ASR
Comments: Accepted to IEEE ASRU 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[158]  arXiv:2309.13664 [pdf, other]
Title: VoiceLDM: Text-to-Speech with Environmental Context
Comments: Demos and code are available at this https URL
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[159]  arXiv:2309.13819 [pdf, other]
Title: A Two-Step Approach for Narrowband Source Localization in Reverberant Rooms
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[160]  arXiv:2309.13874 [pdf, other]
Title: Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[161]  arXiv:2309.13905 [pdf, other]
Title: AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[162]  arXiv:2309.13916 [pdf, other]
Title: Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[163]  arXiv:2309.13938 [pdf, ps, other]
Title: Evaluating Classification Systems Against Soft Labels with Fuzzy Precision and Recall
Comments: published in DCASE 2023
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[164]  arXiv:2309.13963 [pdf, other]
Title: Connecting Speech Encoder and Large Language Model for ASR
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[165]  arXiv:2309.13994 [pdf, other]
Title: Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units
Comments: Submitted to ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[166]  arXiv:2309.14080 [pdf, other]
Title: Analysis and Detection of Pathological Voice using Glottal Source Features
Comments: Copyright 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Journal-ref: IEEE Journal of Selected Topics in Signal Processing, Vol. 14, No. 2, pp. 367-379, February 2020
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
[167]  arXiv:2309.14089 [pdf, other]
Title: BiSinger: Bilingual Singing Voice Synthesis
Comments: Accepted by ASRU2023
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[168]  arXiv:2309.14107 [pdf, other]
Title: Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech
Comments: copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Journal-ref: in Proc. ICASSP, Rhodes Island, Greece, June 4-10, 2023
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
[169]  arXiv:2309.14109 [pdf, other]
Title: Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification
Comments: accepted by ASRU 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[170]  arXiv:2309.14129 [pdf, other]
Title: Speaker anonymization using neural audio codec language models
Comments: Accepted at ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[171]  arXiv:2309.14324 [pdf, other]
Title: Towards General-Purpose Text-Instruction-Guided Voice Conversion
Comments: Accepted to ASRU 2023
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
[172]  arXiv:2309.14460 [pdf, other]
Title: Online Active Learning For Sound Event Detection
Comments: Submitted to ICASSP 2024. Publication will belong to IEEE
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Signal Processing (eess.SP)
[173]  arXiv:2309.14462 [pdf, ps, other]
Title: On the Impact of Quantization and Pruning of Self-Supervised Speech Models for Downstream Speech Recognition Tasks "In-the-Wild''
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[174]  arXiv:2309.14507 [pdf, other]
Title: Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity
Comments: Submitted to ICASSP 2024, 5 pages
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[175]  arXiv:2309.14521 [pdf, other]
Title: NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping
Comments: final version, accepted at ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[176]  arXiv:2309.14741 [pdf, other]
Title: Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[177]  arXiv:2309.14758 [pdf, other]
Title: Exploring RWKV for Memory Efficient and Low Latency Streaming ASR
Comments: submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[178]  arXiv:2309.14761 [pdf, other]
Title: Optimization Techniques for a Physical Model of Human Vocalisation
Comments: Accepted to DAFx 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[179]  arXiv:2309.14922 [pdf, other]
Title: Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference
Comments: Accepted at ASRU 2023
Journal-ref: IEEE Automatic Speech Recognition and Understanding Workshop 2023
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[180]  arXiv:2309.15064 [pdf, other]
Title: Simultaneously Learning Speaker's Direction and Head Orientation from Binaural Recordings
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[181]  arXiv:2309.15224 [pdf, other]
Title: Collaborative Watermarking for Adversarial Speech Synthesis
Authors: Lauri Juvela (Aalto University, Finland), Xin Wang (National Institute of Informatics, Japan)
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
[182]  arXiv:2309.15496 [pdf, other]
Title: DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion
Comments: Accepted by ICASSP2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[183]  arXiv:2309.15643 [pdf, other]
Title: Why do Angular Margin Losses work well for Semi-Supervised Anomalous Sound Detection?
Journal-ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32 (2024), p. 608-622
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[184]  arXiv:2309.15717 [pdf, other]
Title: Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription
Comments: Accepted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[185]  arXiv:2309.15796 [pdf, other]
Title: Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
[186]  arXiv:2309.15938 [pdf, other]
Title: Exploring Self-Supervised Contrastive Learning of Spatial Sound Event Representation
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[187]  arXiv:2309.16036 [pdf, other]
Title: Multichannel Voice Trigger Detection Based on Transform-average-concatenate
Comments: Accepted at HSCMA 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[188]  arXiv:2309.16048 [pdf, other]
Title: Advancing Acoustic Howling Suppression through Recursive Training of Neural Networks
Comments: Paper in submission
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[189]  arXiv:2309.16049 [pdf, other]
Title: Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression
Comments: Paper in submission
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[190]  arXiv:2309.16060 [pdf, other]
Title: Does Single-channel Speech Enhancement Improve Keyword Spotting Accuracy? A Case Study
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[191]  arXiv:2309.16093 [pdf, ps, other]
Title: Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR
Comments: Submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[192]  arXiv:2309.16247 [pdf, other]
Title: PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[193]  arXiv:2309.16482 [pdf, ps, other]
Title: Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
Comments: Accepted at HSCMA Sattelite Workshop at ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[194]  arXiv:2309.16867 [pdf, other]
Title: Towards High Resolution Weather Monitoring with Sound Data
Comments: 5 pages, submitted to ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[195]  arXiv:2309.16953 [pdf, other]
Title: Enhancing Code-switching Speech Recognition with Interactive Language Biases
Comments: Submitted to IEEE ICASSP 2024
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[196]  arXiv:2309.16954 [pdf, other]
Title: Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features
Comments: 5 pages, 3 figures, 4 tables
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[197]  arXiv:2309.17020 [pdf, other]
Title: Low-Resource Self-Supervised Learning with SSL-Enhanced TTS
Comments: ASRU 2023 SPARKS Workshop
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[198]  arXiv:2309.17267 [pdf, other]
Title: Wiki-En-ASR-Adapt: Large-scale synthetic dataset for English ASR Customization
Comments: Accepted to IEEE ASRU 2023
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
[199]  arXiv:2309.17298 [pdf, other]
Title: LRPD: Large Replay Parallel Dataset
Journal-ref: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6612-6616
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[200]  arXiv:2309.17384 [pdf, other]
Title: Toward Universal Speech Enhancement for Diverse Input Conditions
Comments: 6 pages, 3 figures, 5 tables, published in ASRU 2023 (corrected the results of noisy speech on CHiME-4 (Simu) in Table 4)
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
[201]  arXiv:2309.02961 (cross-list from eess.SP) [pdf, other]
Title: LuViRA Dataset Validation and Discussion: Comparing Vision, Radio, and Audio Sensors for Indoor Localization
Comments: 10 pages, 11 figures
Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[202]  arXiv:2309.04670 (cross-list from eess.SP) [pdf, ps, other]
Title: Generalized Minimum Error with Fiducial Points Criterion for Robust Learning
Comments: 12 pages, 9 figures
Subjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
[203]  arXiv:2309.07147 (cross-list from eess.SP) [pdf, other]
Title: DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection
Subjects: Signal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[204]  arXiv:2309.09645 (cross-list from eess.SP) [pdf, ps, other]
Title: Scaling the time and Fourier domains to align periodically and their convolution
Subjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[205]  arXiv:2309.15136 (cross-list from eess.SP) [pdf, other]
Title: A multi-modal approach for identifying schizophrenia using cross-modal attention
Comments: Accepted to Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2024
Subjects: Signal Processing (eess.SP); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
[206]  arXiv:2309.00126 (cross-list from cs.SD) [pdf, other]
Title: QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[207]  arXiv:2309.00140 (cross-list from cs.SD) [pdf, other]
Title: Improving vision-inspired keyword spotting using dynamic module skipping in streaming conformer encoder
Journal-ref: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[208]  arXiv:2309.00284 (cross-list from cs.SD) [pdf, other]
Title: Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[209]  arXiv:2309.00329 (cross-list from cs.SD) [pdf, other]
Title: Mi-Go: Test Framework which uses YouTube as Data Source for Evaluating Speech Recognition Models like OpenAI's Whisper
Comments: 25 pages, 9 tables, 3 figures
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Software Engineering (cs.SE); Audio and Speech Processing (eess.AS)
[210]  arXiv:2309.00347 (cross-list from cs.IR) [pdf, ps, other]
Title: Towards Contrastive Learning in Music Video Domain
Comments: 6 pages, 2 figures, 2 tables
Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[211]  arXiv:2309.00454 (cross-list from cs.SD) [pdf, other]
Title: CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[212]  arXiv:2309.00723 (cross-list from cs.CL) [pdf, other]
Title: Contextual Biasing of Named-Entities with Large Language Models
Comments: 5 pages, 4 figures. Conference: ICASSP 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[213]  arXiv:2309.00878 (cross-list from cs.SD) [pdf, other]
Title: Pretraining Representations for Bioacoustic Few-shot Detection using Supervised Contrastive Learning
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[214]  arXiv:2309.00883 (cross-list from cs.SD) [pdf, other]
Title: DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin
Comments: accepted by TASLP
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[215]  arXiv:2309.00916 (cross-list from cs.CL) [pdf, other]
Title: BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[216]  arXiv:2309.00929 (cross-list from cs.SD) [pdf, other]
Title: Timbre-reserved Adversarial Attack in Speaker Identification
Comments: 11 pages, 8 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[217]  arXiv:2309.01076 (cross-list from cs.LG) [pdf, other]
Title: Federated Few-shot Learning for Cough Classification with Edge Devices
Comments: 21 pages, 5 figures
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[218]  arXiv:2309.01202 (cross-list from cs.GR) [pdf, other]
Title: MAGMA: Music Aligned Generative Motion Autodecoder
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[219]  arXiv:2309.01212 (cross-list from cs.SD) [pdf, other]
Title: NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[220]  arXiv:2309.01340 (cross-list from cs.SD) [pdf, other]
Title: MDSC: Towards Evaluating the Style Consistency Between Music and Dance
Comments: 19 pages, 19 figure
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[221]  arXiv:2309.01437 (cross-list from cs.SD) [pdf, other]
Title: SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge
Comments: Proceedings of Interspeech
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[222]  arXiv:2309.01480 (cross-list from cs.SD) [pdf, other]
Title: BadSQA: Stealthy Backdoor Attacks Using Presence Events as Triggers in Non-Intrusive Speech Quality Assessment
Comments: 5 pages, 6 figures,conference
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[223]  arXiv:2309.01576 (cross-list from cs.CL) [pdf, other]
Title: A Comparative Analysis of Pretrained Language Models for Text-to-Speech
Comments: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop (SSW) in Grenoble, France, from 26th to 28th August 2023
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[224]  arXiv:2309.01947 (cross-list from cs.CL) [pdf, other]
Title: TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models
Comments: Meta AI; Submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[225]  arXiv:2309.01950 (cross-list from cs.CV) [pdf, other]
Title: RADIO: Reference-Agnostic Dubbing Video Synthesis
Comments: Accepted by WACV 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[226]  arXiv:2309.02106 (cross-list from cs.CL) [pdf, other]
Title: Leveraging Label Information for Multimodal Emotion Recognition
Comments: Accepted by Interspeech 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[227]  arXiv:2309.02133 (cross-list from cs.SD) [pdf, other]
Title: Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion
Comments: Accepted to the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Demo page: this https URL Code: this https URL
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[228]  arXiv:2309.02145 (cross-list from cs.CL) [pdf, other]
Title: Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition
Comments: Submitted and accepted for ICANN 2023 (32nd International Conference on Artificial Neural Networks)
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[229]  arXiv:2309.02232 (cross-list from cs.SD) [pdf, other]
Title: FSD: An Initial Chinese Dataset for Fake Song Detection
Comments: Submitted to ICASSP 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[230]  arXiv:2309.02243 (cross-list from cs.SD) [pdf, other]
Title: Self-Similarity-Based and Novelty-based loss for music structure analysis
Authors: Geoffroy Peeters
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[231]  arXiv:2309.02399 (cross-list from cs.SD) [pdf, other]
Title: The Batik-plays-Mozart Corpus: Linking Performance to Score to Musicological Annotations
Comments: To be published in the Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy
Subjects: Sound (cs.SD); Digital Libraries (cs.DL); Audio and Speech Processing (eess.AS)
[232]  arXiv:2309.02404 (cross-list from cs.SD) [pdf, other]
Title: Voice Morphing: Two Identities in One Voice
Comments: Accepted oral paper at BIOSIG 2023
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[233]  arXiv:2309.02405 (cross-list from cs.CV) [pdf, other]
Title: Generating Realistic Images from In-the-wild Sounds
Comments: Accepted to ICCV 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[234]  arXiv:2309.02459 (cross-list from cs.SD) [pdf, other]
Title: Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation
Comments: Proceedings of Interspeech. arXiv admin note: text overlap with arXiv:2309.01437
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[235]  arXiv:2309.02612 (cross-list from cs.SD) [pdf, other]
Title: Music Source Separation with Band-Split RoPE Transformer
Comments: This paper explains the SAMI-ByteDance MSS system submitted to Sound Demixing Challenge (SDX23) Music Separation Track. Version 2 of paper fixed some typos
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[236]  arXiv:2309.02767 (cross-list from cs.SD) [pdf, ps, other]
Title: Simultaneous Measurement of Multiple Acoustic Attributes Using Structured Periodic Test Signals Including Music and Other Sound Materials
Comments: 8 pages, 17 figures, accepted for APSIPA ASC 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[237]  arXiv:2309.02780 (cross-list from cs.CL) [pdf, other]
Title: GRASS: Unified Generation Model for Speech-to-Semantic Tasks
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[238]  arXiv:2309.02796 (cross-list from cs.SD) [pdf, other]
Title: Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals
Authors: Yiming Wu
Comments: Accepted to DAFx 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[239]  arXiv:2309.02836 (cross-list from cs.SD) [pdf, other]
Title: BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
Comments: Accepted at ICASSP 2024. Equation (5) in the previous version is wrong. We modified it
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[240]  arXiv:2309.03036 (cross-list from cs.SD) [pdf, other]
Title: An Efficient Temporary Deepfake Location Approach Based Embeddings for Partially Spoofed Audio Detection
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[241]  arXiv:2309.03104 (cross-list from quant-ph) [pdf, other]
Title: Quid Manumit -- Freeing the Qubit for Art
Authors: Mark Carney
Comments: 8 pages, 6 figures, to appear at ISQCMC in Berlin, Oct 5-6th 2023
Subjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[242]  arXiv:2309.03238 (cross-list from cs.LG) [pdf, other]
Title: Implicit Design Choices and Their Impact on Emotion Recognition Model Development and Evaluation
Authors: Mimansa Jaiswal
Comments: PhD Thesis
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[243]  arXiv:2309.03298 (cross-list from cs.SD) [pdf, ps, other]
Title: Presenting the SWTC: A Symbolic Corpus of Themes from John Williams' Star Wars Episodes I-IX
Comments: Corpus report (5000 words)
Subjects: Sound (cs.SD); Symbolic Computation (cs.SC); Audio and Speech Processing (eess.AS)
[244]  arXiv:2309.03364 (cross-list from cs.SD) [pdf, other]
Title: Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature
Comments: 5 pages, 3 figures, submitted to ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[245]  arXiv:2309.03378 (cross-list from cs.CL) [pdf, other]
Title: RoDia: A New Dataset for Romanian Dialect Identification from Speech
Comments: Accepted at NAACL 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[246]  arXiv:2309.03404 (cross-list from cs.HC) [pdf, other]
Title: The Role of Communication and Reference Songs in the Mixing Process: Insights from Professional Mix Engineers
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[247]  arXiv:2309.03451 (cross-list from cs.SD) [pdf, other]
Title: Cross-domain Sound Recognition for Efficient Underwater Data Analysis
Comments: Accepted to APSIPA 2023
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[248]  arXiv:2309.03516 (cross-list from cs.SD) [pdf, other]
Title: Topological fingerprints for audio identification
Comments: 26 pages
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Algebraic Topology (math.AT)
[249]  arXiv:2309.03544 (cross-list from cs.SD) [pdf, other]
Title: MVD:A Novel Methodology and Dataset for Acoustic Vehicle Type Classification
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[250]  arXiv:2309.03619 (cross-list from cs.SD) [pdf, other]
Title: Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
Comments: 13 pages, 5 figures, in submission to MDPI Information
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[251]  arXiv:2309.03641 (cross-list from cs.SD) [pdf, other]
Title: Spiking Structured State Space Model for Monaural Speech Enhancement
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[252]  arXiv:2309.03884 (cross-list from cs.SD) [pdf, other]
Title: Zero-Shot Audio Captioning via Audibility Guidance
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[253]  arXiv:2309.03905 (cross-list from cs.MM) [pdf, other]
Title: ImageBind-LLM: Multi-modality Instruction Tuning
Comments: Code is available at this https URL
Subjects: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[254]  arXiv:2309.03926 (cross-list from cs.SD) [pdf, other]
Title: Large-Scale Automatic Audiobook Creation
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Digital Libraries (cs.DL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[255]  arXiv:2309.03978 (cross-list from cs.CL) [pdf, other]
Title: LanSER: Language-Model Supported Speech Emotion Recognition
Comments: Presented at INTERSPEECH 2023
Journal-ref: INTERSPEECH (2023) 2408-2412
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[256]  arXiv:2309.04031 (cross-list from cs.CL) [pdf, other]
Title: Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems
Comments: Accepted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[257]  arXiv:2309.04132 (cross-list from cs.SD) [pdf, other]
Title: A Two-Stage Training Framework for Joint Speech Compression and Enhancement
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[258]  arXiv:2309.04156 (cross-list from cs.SD) [pdf, other]
Title: Cross-Utterance Conditioned VAE for Speech Generation
Comments: 13 pages;
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[259]  arXiv:2309.04182 (cross-list from cs.SD) [pdf, other]
Title: A Long-Tail Friendly Representation Framework for Artist and Music Similarity
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
[260]  arXiv:2309.04420 (cross-list from cs.SD) [pdf, ps, other]
Title: Parallel and Limited Data Voice Conversion Using Stochastic Variational Deep Kernel Learning
Journal-ref: Engineering Applications of Artificial Intelligence.115(2022)
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[261]  arXiv:2309.04505 (cross-list from cs.SD) [pdf, other]
Title: COVID-19 Detection System: A Comparative Analysis of System Performance Based on Acoustic Features of Cough Audio Signals
Comments: 8 pages, 3 figures
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[262]  arXiv:2309.04509 (cross-list from cs.SD) [pdf, other]
Title: The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion
Comments: ICCV2023
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
[263]  arXiv:2309.04641 (cross-list from cs.SD) [pdf, other]
Title: Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[264]  arXiv:2309.04654 (cross-list from cs.SD) [pdf, other]
Title: Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition
Comments: Accepted to EUSIPCO 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[265]  arXiv:2309.04762 (cross-list from cs.SD) [pdf, other]
Title: AudRandAug: Random Image Augmentations for Audio Classification
Comments: Paper has accepted at 25th Irish Machine Vision and Image Processing Conference
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[266]  arXiv:2309.04842 (cross-list from cs.CL) [pdf, other]
Title: Leveraging Large Language Models for Exploiting ASR Uncertainty
Comments: Added references
Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[267]  arXiv:2309.04861 (cross-list from cs.SD) [pdf, other]
Title: Exploring Music Genre Classification: Algorithm Analysis and Deployment Architecture
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
[268]  arXiv:2309.04946 (cross-list from cs.SD) [pdf, other]
Title: Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
Comments: Accepted to ICCV 2023. Project page: this https URL
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
[269]  arXiv:2309.05058 (cross-list from cs.SD) [pdf, other]
Title: Multimodal Fish Feeding Intensity Assessment in Aquaculture
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[270]  arXiv:2309.05287 (cross-list from cs.SD) [pdf, other]
Title: Addressing Feature Imbalance in Sound Source Separation
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[271]  arXiv:2309.05353 (cross-list from cs.HC) [pdf, ps, other]
Title: Applied design thinking in urban air mobility: creating the airtaxi cabin design of the future from a user perspective
Comments: 13 pages
Subjects: Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY)
[272]  arXiv:2309.05357 (cross-list from cs.SD) [pdf, other]
Title: EDAC: Efficient Deployment of Audio Classification Models For COVID-19 Detection
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[273]  arXiv:2309.05396 (cross-list from cs.SD) [pdf, other]
Title: SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus
Comments: Accepted by ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[274]  arXiv:2309.05472 (cross-list from cs.CL) [pdf, other]
Title: LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Comments: Published in Computer Science and Language. Preprint allowed
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[275]  arXiv:2309.05595 (cross-list from cs.SD) [pdf, ps, other]
Title: Undecidability Results and Their Relevance in Modern Music Making
Authors: Halley Young
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[276]  arXiv:2309.05634 (cross-list from cs.SD) [pdf, other]
Title: Kernel Interpolation of Incident Sound Field in Region Including Scattering Objects
Comments: Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[277]  arXiv:2309.05767 (cross-list from cs.SD) [pdf, other]
Title: Natural Language Supervision for General-Purpose Audio Representations
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[278]  arXiv:2309.05843 (cross-list from cs.LG) [pdf, other]
Title: Optimizing Audio Augmentations for Contrastive Learning of Health-Related Acoustic Signals
Comments: 7 pages, 2 pages appendix, 2 figures, 5 appendix tables
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[279]  arXiv:2309.05855 (cross-list from cs.LG) [pdf, other]
Title: Instabilities in Convnets for Raw Audio
Comments: 4 pages, 5 figures, 1 page appendix with mathematical proofs
Journal-ref: IEEE Signal Processing Letters 31 (2024) 1084-1088
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[280]  arXiv:2309.05975 (cross-list from cs.LG) [pdf, other]
Title: CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram
Comments: INTERSPEECH 2023
Journal-ref: Proc. INTERSPEECH 2023, pages 790--794
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[281]  arXiv:2309.06141 (cross-list from cs.SD) [pdf, other]
Title: SynVox2: Towards a privacy-friendly VoxCeleb2 dataset
Comments: conference
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[282]  arXiv:2309.06649 (cross-list from cs.SD) [pdf, other]
Title: Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis
Comments: To be published in The Proceedings of Forum Acusticum, Sep 2023, Turin, Italy
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[283]  arXiv:2309.06672 (cross-list from cs.SD) [pdf, other]
Title: Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer
Comments: IEEE/ACM Transactions on Audio Speech and Language Processing Under Review
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[284]  arXiv:2309.06723 (cross-list from cs.SD) [pdf, other]
Title: PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
Comments: Interspeech 2023
Journal-ref: Proc. INTERSPEECH 2023, 3719-3723
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[285]  arXiv:2309.06728 (cross-list from cs.CV) [pdf, other]
Title: Leveraging Foundation models for Unsupervised Audio-Visual Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[286]  arXiv:2309.06780 (cross-list from cs.SD) [pdf, other]
Title: Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms
Comments: Submitted to ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[287]  arXiv:2309.06787 (cross-list from cs.SD) [pdf, other]
Title: DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation
Comments: 5 pages, submitted to ICASSP
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[288]  arXiv:2309.06858 (cross-list from cs.SD) [pdf, other]
Title: EMALG: An Enhanced Mandarin Lombard Grid Corpus with Meaningful Sentences
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[289]  arXiv:2309.06981 (cross-list from cs.CR) [pdf, other]
Title: MASTERKEY: Practical Backdoor Attack Against Speaker Verification Systems
Comments: Accepted by Mobicom 2023
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[290]  arXiv:2309.07115 (cross-list from cs.SD) [pdf, other]
Title: Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[291]  arXiv:2309.07195 (cross-list from cs.SD) [pdf, other]
Title: Diffusion models for audio semantic communication
Comments: Submitted to IEEE ICASSP 2024
Subjects: Sound (cs.SD); Emerging Technologies (cs.ET); Audio and Speech Processing (eess.AS)
[292]  arXiv:2309.07314 (cross-list from cs.SD) [pdf, other]
Title: AudioSR: Versatile Audio Super-resolution at Scale
Comments: Under review. Demo and code: this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[293]  arXiv:2309.07391 (cross-list from cs.SD) [pdf, other]
Title: EnCodecMAE: Leveraging neural codecs for universal audio representation learning
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[294]  arXiv:2309.07405 (cross-list from cs.SD) [pdf, other]
Title: FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
Comments: 5 pages, 3 figures, submitted to ICASSP 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[295]  arXiv:2309.07413 (cross-list from cs.CL) [pdf, other]
Title: CPPF: A contextual and post-processing-free model for automatic speech recognition
Comments: Submitted to ICASSP2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[296]  arXiv:2309.07416 (cross-list from cs.SD) [pdf, other]
Title: M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec
Comments: More results and source code are available at this https URL
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[297]  arXiv:2309.07419 (cross-list from cs.SD) [pdf, other]
Title: Mandarin Lombard Flavor Classification
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[298]  arXiv:2309.07432 (cross-list from cs.SD) [pdf, other]
Title: SpatialCodec: Neural Spatial Speech Coding
Comments: Paper in Submission
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[299]  arXiv:2309.07458 (cross-list from cs.SD) [pdf, other]
Title: Analysis of Speech Separation Performance Degradation on Emotional Speech Mixtures
Comments: Accepted by APSIPA ASC 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[300]  arXiv:2309.07478 (cross-list from cs.CL) [pdf, other]
Title: Direct Text to Speech Translation System using Acoustic Units
Comments: 5 pages, 4 figures
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[301]  arXiv:2309.07500 (cross-list from cs.SD) [pdf, other]
Title: Outlier-aware Inlier Modeling and Multi-scale Scoring for Anomalous Sound Detection via Multitask Learning
Comments: accepted at INTERSPEECH 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[302]  arXiv:2309.07525 (cross-list from cs.SD) [pdf, other]
Title: SingFake: Singing Voice Deepfake Detection
Comments: Accepted at ICASSP 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[303]  arXiv:2309.07566 (cross-list from cs.SD) [pdf, other]
Title: Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Comments: 5 pages, 1 figure. submitted to ICASSP 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[304]  arXiv:2309.07598 (cross-list from cs.SD) [pdf, other]
Title: AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion
Comments: Submitted to ICASSP 2024. Demo: this https URL Code: this https URL
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[305]  arXiv:2309.07615 (cross-list from cs.SD) [pdf, other]
Title: Multilingual Audio Captioning using machine translated data
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[306]  arXiv:2309.07658 (cross-list from cs.SD) [pdf, other]
Title: DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[307]  arXiv:2309.07707 (cross-list from cs.CL) [pdf, other]
Title: CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders
Comments: Accepted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[308]  arXiv:2309.07719 (cross-list from cs.CL) [pdf, other]
Title: L1-aware Multilingual Mispronunciation Detection Framework
Comments: 5 papers, submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[309]  arXiv:2309.07733 (cross-list from cs.CL) [pdf, other]
Title: Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Comments: 8 pages
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[310]  arXiv:2309.07739 (cross-list from cs.CL) [pdf, other]
Title: The complementary roles of non-verbal cues for Robust Pronunciation Assessment
Comments: 5 pages, submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[311]  arXiv:2309.07765 (cross-list from cs.SD) [pdf, other]
Title: Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[312]  arXiv:2309.07861 (cross-list from cs.SD) [pdf, other]
Title: CiwaGAN: Articulatory information exchange
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[313]  arXiv:2309.07929 (cross-list from cs.CV) [pdf, other]
Title: Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
Comments: Accepted by AAAI 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[314]  arXiv:2309.07983 (cross-list from cs.CR) [pdf, other]
Title: SLMIA-SR: Speaker-Level Membership Inference Attacks against Speaker Recognition Systems
Comments: In Proceedings of the 31st Network and Distributed System Security (NDSS) Symposium, 2024
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[315]  arXiv:2309.07988 (cross-list from cs.LG) [pdf, other]
Title: Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition
Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[316]  arXiv:2309.08027 (cross-list from cs.SD) [pdf, ps, other]
Title: Comparative Assessment of Markov Models and Recurrent Neural Networks for Jazz Music Generation
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[317]  arXiv:2309.08049 (cross-list from cs.SD) [pdf, other]
Title: VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research
Comments: Accepted by OJSP-ICASSP 2024 this https URL
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[318]  arXiv:2309.08051 (cross-list from cs.SD) [pdf, other]
Title: Retrieval-Augmented Text-to-Audio Generation
Comments: Accepted by ICASSP 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[319]  arXiv:2309.08072 (cross-list from cs.SD) [pdf, other]
Title: SSL-Net: A Synergistic Spectral and Learning-based Network for Efficient Bird Sound Classification
Comments: Accepted by IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[320]  arXiv:2309.08087 (cross-list from cs.CV) [pdf, other]
Title: hear-your-action: human action recognition by ultrasound active sensing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[321]  arXiv:2309.08099 (cross-list from cs.SD) [pdf, other]
Title: Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection
Comments: Submitted to ICASSP 2024
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[322]  arXiv:2309.08108 (cross-list from cs.SD) [pdf, other]
Title: Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting
Comments: Under review
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[323]  arXiv:2309.08127 (cross-list from cs.SD) [pdf, other]
Title: Diversity-based core-set selection for text-to-speech with linguistic and acoustic features
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[324]  arXiv:2309.08144 (cross-list from cs.SD) [pdf, other]
Title: Two-Step Knowledge Distillation for Tiny Speech Enhancement
Comments: Under review ICASSP 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[325]  arXiv:2309.08146 (cross-list from cs.SD) [pdf, other]
Title: Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs
Comments: Winning Solution of IEEE SP Cup at ICASSP 2022
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[326]  arXiv:2309.08150 (cross-list from cs.CL) [pdf, other]
Title: Unimodal Aggregation for CTC-based Speech Recognition
Authors: Ying Fang, Xiaofei Li
Comments: Accepted by ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[327]  arXiv:2309.08166 (cross-list from cs.SD) [pdf, other]
Title: Controllable Residual Speaker Representation for Voice Conversion
Comments: submitted to ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[328]  arXiv:2309.08200 (cross-list from cs.SD) [pdf, other]
Title: TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification
Comments: Accepted by the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[329]  arXiv:2309.08208 (cross-list from cs.SD) [pdf, other]
Title: HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods
Comments: Submitted to 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)
Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[330]  arXiv:2309.08398 (cross-list from cs.SD) [pdf, other]
Title: Exploring Meta Information for Audio-based Zero-shot Bird Classification
Comments: Accepted at ICASSP 2024
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[331]  arXiv:2309.08408 (cross-list from cs.SD) [pdf, other]
Title: Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech
Comments: Submitted to ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[332]  arXiv:2309.08531 (cross-list from cs.CV) [pdf, other]
Title: Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
[333]  arXiv:2309.08535 (cross-list from cs.CV) [pdf, other]
Title: Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper
Comments: Accepted at ICASSP 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[334]  arXiv:2309.08551 (cross-list from cs.CL) [pdf, other]
Title: Augmenting conformers with structured state-space sequence models for online speech recognition
Comments: ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[335]  arXiv:2309.08751 (cross-list from cs.SD) [pdf, other]
Title: Diverse Neural Audio Embeddings -- Bringing Features back !
Authors: Prateek Verma
Comments: 6 pages, 1 figure, 2 table
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[336]  arXiv:2309.08773 (cross-list from cs.SD) [pdf, other]
Title: Enhance audio generation controllability through representation similarity regularization
Comments: 5 pages
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[337]  arXiv:2309.08837 (cross-list from cs.SD) [pdf, other]
Title: FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework
Comments: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[338]  arXiv:2309.08839 (cross-list from cs.SD) [pdf, other]
Title: Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
Comments: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)
Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[339]  arXiv:2309.08971 (cross-list from cs.SD) [pdf, other]
Title: Regularized Contrastive Pre-training for Few-shot Bioacoustic Sound Detection
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[340]  arXiv:2309.09075 (cross-list from cs.SD) [src]
Title: Music Generation based on Generative Adversarial Networks with Transformer
Comments: arXiv admin note: This version has been removed by arXiv administrators due to copyright infringement
Subjects: Sound (cs.SD)
[341]  arXiv:2309.09085 (cross-list from cs.SD) [pdf, other]
Title: SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription
Comments: Accepted to ICASSP 2024
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[342]  arXiv:2309.09088 (cross-list from cs.SD) [pdf, other]
Title: Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[343]  arXiv:2309.09136 (cross-list from cs.SD) [pdf, other]
Title: Enhancing Quantised End-to-End ASR Models via Personalisation
Comments: 5 pages, submitted to ICASSP 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[344]  arXiv:2309.09223 (cross-list from cs.SD) [pdf, other]
Title: Zero- and Few-shot Sound Event Localization and Detection
Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[345]  arXiv:2309.09288 (cross-list from cs.SD) [pdf, other]
Title: Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions
Comments: Accepted in WASPAA 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[346]  arXiv:2309.09329 (cross-list from cs.SD) [pdf, other]
Title: A Few-Shot Approach to Dysarthric Speech Intelligibility Level Classification Using Transformers
Comments: Paper has been presented at ICCCNT 2023 and the final version will be published in IEEE Digital Library Xplore
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[347]  arXiv:2309.09390 (cross-list from cs.CL) [pdf, other]
Title: Augmenting text for spoken language understanding with Large Language Models
Comments: Submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[348]  arXiv:2309.09413 (cross-list from cs.SD) [pdf, other]
Title: Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[349]  arXiv:2309.09469 (cross-list from cs.SD) [pdf, other]
Title: Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks
Comments: Accepted by ICASSP2024
Subjects: Sound (cs.SD); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
[350]  arXiv:2309.09470 (cross-list from cs.SD) [pdf, other]
Title: Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[351]  arXiv:2309.09586 (cross-list from cs.CR) [pdf, ps, other]
Title: Spoofing attack augmentation: can differently-trained attack models improve generalisation?
Comments: Accepted to ICASSP 2024
Subjects: Cryptography and Security (cs.CR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[352]  arXiv:2309.09623 (cross-list from cs.SD) [pdf, other]
Title: HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[353]  arXiv:2309.09627 (cross-list from cs.SD) [pdf, other]
Title: Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders
Comments: Accepted to ICASSP 2024. Demo page: lesterphillip.github.io/icassp2024_el_sie
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[354]  arXiv:2309.09652 (cross-list from cs.SD) [pdf, other]
Title: Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer
Authors: Peter Ochieng
Comments: 10 pages
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[355]  arXiv:2309.09690 (cross-list from cs.CL) [pdf, other]
Title: Do learned speech symbols follow Zipf's law?
Comments: Submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[356]  arXiv:2309.09705 (cross-list from cs.SD) [pdf, other]
Title: Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[357]  arXiv:2309.09799 (cross-list from cs.CL) [pdf, other]
Title: Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[358]  arXiv:2309.09837 (cross-list from cs.SD) [pdf, other]
Title: Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection
Subjects: Sound (cs.SD); Computers and Society (cs.CY); Audio and Speech Processing (eess.AS)
[359]  arXiv:2309.09838 (cross-list from cs.CL) [pdf, ps, other]
Title: HypR: A comprehensive study for ASR hypothesis revising with a reference corpus
Comments: Submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[360]  arXiv:2309.09843 (cross-list from cs.CL) [pdf, other]
Title: Instruction-Following Speech Recognition
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[361]  arXiv:2309.10280 (cross-list from cs.SD) [pdf, other]
Title: Crowdotic: A Privacy-Preserving Hospital Waiting Room Crowd Density Estimation with Non-speech Audio
Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[362]  arXiv:2309.10294 (cross-list from cs.CL) [pdf, other]
Title: Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[363]  arXiv:2309.10379 (cross-list from cs.SD) [pdf, ps, other]
Title: PDPCRN: Parallel Dual-Path CRN with Bi-directional Inter-Branch Interactions for Multi-Channel Speech Enhancement
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[364]  arXiv:2309.10393 (cross-list from cs.SD) [pdf, ps, other]
Title: Hierarchical Modeling of Spatial Cues via Spherical Harmonics for Multi-Channel Speech Enhancement
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[365]  arXiv:2309.10439 (cross-list from cs.CV) [pdf, ps, other]
Title: Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder
Authors: Mostafa Sadeghi (MULTISPEECH), Romain Serizel (MULTISPEECH)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP); Machine Learning (stat.ML)
[366]  arXiv:2309.10450 (cross-list from cs.CV) [pdf, ps, other]
Title: Unsupervised speech enhancement with diffusion-based generative models
Authors: Berné Nortier (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Romain Serizel (MULTISPEECH)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP); Machine Learning (stat.ML)
[367]  arXiv:2309.10456 (cross-list from cs.SD) [pdf, other]
Title: Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[368]  arXiv:2309.10457 (cross-list from cs.CV) [pdf, ps, other]
Title: Diffusion-based speech enhancement with a weighted generative-supervised learning loss
Authors: Jean-Eudes Ayilo (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Romain Serizel (MULTISPEECH)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP); Machine Learning (stat.ML)
[369]  arXiv:2309.10485 (cross-list from cs.SD) [pdf, other]
Title: A comparative study of Grid and Natural sentences effects on Normal-to-Lombard conversion
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[370]  arXiv:2309.10560 (cross-list from cs.SD) [pdf, other]
Title: Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[371]  arXiv:2309.10567 (cross-list from cs.CL) [pdf, other]
Title: Multimodal Modeling For Spoken Language Identification
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[372]  arXiv:2309.10597 (cross-list from cs.SD) [pdf, other]
Title: Motif-Centric Representation Learning for Symbolic Music
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[373]  arXiv:2309.10667 (cross-list from cs.CV) [pdf, other]
Title: Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Comments: Accepted at BMVC 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[374]  arXiv:2309.10674 (cross-list from cs.SD) [pdf, other]
Title: USED: Universal Speaker Extraction and Diarization
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[375]  arXiv:2309.10719 (cross-list from cs.SD) [pdf, other]
Title: Harmony and Duality: An introduction to Music Theory
Comments: 70 pages, 72 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[376]  arXiv:2309.10724 (cross-list from cs.CV) [pdf, other]
Title: Sound Source Localization is All about Cross-Modal Alignment
Comments: ICCV 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[377]  arXiv:2309.10738 (cross-list from cs.SD) [pdf, other]
Title: MelodyGLM: Multi-task Pre-training for Symbolic Melody Generation
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[378]  arXiv:2309.10740 (cross-list from cs.SD) [pdf, other]
Title: ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[379]  arXiv:2309.10832 (cross-list from cs.SD) [pdf, ps, other]
Title: Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding
Comments: arXiv admin note: text overlap with arXiv:2309.10393
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[380]  arXiv:2309.10926 (cross-list from cs.CL) [pdf, other]
Title: Semi-Autoregressive Streaming ASR With Label Context
Comments: Accepted at ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[381]  arXiv:2309.10930 (cross-list from cs.SD) [pdf, other]
Title: Test-Time Training for Speech
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[382]  arXiv:2309.10993 (cross-list from cs.SD) [pdf, other]
Title: Directional Source Separation for Robust Speech Recognition on Smart Glasses
Comments: Submitted to ICASSP 2024
Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
[383]  arXiv:2309.11000 (cross-list from cs.CL) [pdf, other]
Title: Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[384]  arXiv:2309.11140 (cross-list from cs.SD) [pdf, other]
Title: Investigating Personalization Methods in Text to Music Generation
Comments: Submitted to ICASSP 2024, Examples at this https URL
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[385]  arXiv:2309.11218 (cross-list from cs.CV) [pdf, other]
Title: Automatic Bat Call Classification using Transformer Networks
Comments: Volume 78, December 2023, 102288
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[386]  arXiv:2309.11379 (cross-list from cs.CL) [pdf, other]
Title: Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff
Comments: Accepted at INTERSPEECH 2023
Journal-ref: Pol\'ak, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[387]  arXiv:2309.11384 (cross-list from cs.CL) [pdf, ps, other]
Title: Long-Form End-to-End Speech Translation via Latent Alignment Segmentation
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[388]  arXiv:2309.11462 (cross-list from cs.CR) [pdf, other]
Title: AudioFool: Fast, Universal and synchronization-free Cross-Domain Attack on Speech Recognition
Comments: 10 pages, 11 Figures
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[389]  arXiv:2309.11500 (cross-list from cs.SD) [pdf, other]
Title: A Large-scale Dataset for Audio-Language Representation Learning
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[390]  arXiv:2309.11725 (cross-list from cs.SD) [pdf, other]
Title: FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency
Comments: Submitted to ICASSP'2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[391]  arXiv:2309.11783 (cross-list from cs.HC) [pdf, other]
Title: Frame Pairwise Distance Loss for Weakly-supervised Sound Event Detection
Comments: Submitted to ICASSP 2024
Subjects: Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[392]  arXiv:2309.11845 (cross-list from cs.SD) [pdf, other]
Title: TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification
Comments: This work has been accepted by ACM MM 2023 for publication
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[393]  arXiv:2309.11849 (cross-list from cs.SD) [pdf, other]
Title: A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis
Comments: ChinaMM 2023
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[394]  arXiv:2309.11895 (cross-list from cs.SD) [pdf, other]
Title: Audio Contrastive based Fine-tuning
Comments: Under review
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[395]  arXiv:2309.11977 (cross-list from cs.SD) [pdf, other]
Title: Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
Comments: Accepted bt ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[396]  arXiv:2309.12111 (cross-list from cs.SD) [pdf, other]
Title: Passage Summarization with Recurrent Models for Audio-Sheet Music Retrieval
Comments: In Proceedings of the 24th Conference of the International Society for Music Information Retrieval (ISMIR 2023), Milan, Italy
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[397]  arXiv:2309.12134 (cross-list from cs.SD) [pdf, other]
Title: Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems
Journal-ref: Proceedings of the 14th ACM Multimedia Systems Conference (MMSys '23), June 7-10, 2023, Vancouver, BC, Canada
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[398]  arXiv:2309.12158 (cross-list from cs.SD) [pdf, other]
Title: Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval
Comments: Proceedings of the IEEE 6th International Conference on Multimedia Information Processing and Retrieval (MIPR)
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[399]  arXiv:2309.12234 (cross-list from cs.CL) [pdf, ps, other]
Title: Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition
Comments: Submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[400]  arXiv:2309.12237 (cross-list from cs.CR) [pdf, other]
Title: t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators
Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence. For associated codes, see this https URL (Github) and this https URL (Google Colab)
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Computation (stat.CO)
[401]  arXiv:2309.12242 (cross-list from cs.SD) [pdf, other]
Title: Weakly-supervised Automated Audio Captioning via text only training
Comments: DCASE Workshop 2023
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[402]  arXiv:2309.12254 (cross-list from cs.ET) [pdf, other]
Title: Variational Quantum Harmonizer: Generating Chord Progressions and Other Sonification Methods with the VQE Algorithm
Comments: Manuscript Accepted to the 2nd International Symposium on Quantum Computing and Musical Creativity (ISQCMC Berlin). Link: this https URL
Subjects: Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS); Quantum Physics (quant-ph)
[403]  arXiv:2309.12283 (cross-list from cs.SD) [pdf, other]
Title: Performance Conditioning for Diffusion-Based Multi-Instrument Music Synthesis
Comments: 5 pages, project page available at benadar293.github.io/midipm
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[404]  arXiv:2309.12306 (cross-list from cs.CV) [pdf, other]
Title: TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[405]  arXiv:2309.12521 (cross-list from cs.SD) [pdf, other]
Title: Profile-Error-Tolerant Target-Speaker Voice Activity Detection
Comments: Submission for ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[406]  arXiv:2309.12672 (cross-list from cs.SD) [pdf, other]
Title: CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers
Comments: Accepted by ASRU2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[407]  arXiv:2309.12802 (cross-list from cs.SD) [pdf, other]
Title: Deepfake audio as a data augmentation technique for training automatic speech to text transcription models
Comments: 9 pages, 6 figures, 7 tables
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[408]  arXiv:2309.13085 (cross-list from cs.SD) [pdf, other]
Title: Does My Dog ''Speak'' Like Me? The Acoustic Correlation between Pet Dogs and Their Human Owners
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[409]  arXiv:2309.13086 (cross-list from cs.SD) [pdf, other]
Title: Towards Lexical Analysis of Dog Vocalizations via Online Videos
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[410]  arXiv:2309.13166 (cross-list from cs.SD) [pdf, other]
Title: Invisible Watermarking for Audio Generation Diffusion Models
Comments: This is an invited paper for IEEE TPS, part of the IEEE CIC/CogMI/TPS 2023 conference
Subjects: Sound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[411]  arXiv:2309.13227 (cross-list from cs.LG) [pdf, other]
Title: Importance of negative sampling in weak label learning
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[412]  arXiv:2309.13259 (cross-list from cs.IR) [pdf, other]
Title: WikiMT++ Dataset Card
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[413]  arXiv:2309.13292 (cross-list from cs.LG) [pdf, other]
Title: Beyond Fairness: Age-Harmless Parkinson's Detection via Voice
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[414]  arXiv:2309.13343 (cross-list from cs.SD) [pdf, other]
Title: Two vs. Four-Channel Sound Event Localization and Detection
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[415]  arXiv:2309.13347 (cross-list from cs.CL) [pdf, other]
Title: My Science Tutor (MyST) -- A Large Corpus of Children's Conversational Speech
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[416]  arXiv:2309.13373 (cross-list from cs.SD) [pdf, other]
Title: Asca: less audio data is more insightful
Comments: 6 pages,3 figures
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[417]  arXiv:2309.13476 (cross-list from cs.CL) [pdf, other]
Title: Hierarchical attention interpretation: an interpretable speech-level transformer for bi-modal depression detection
Comments: 5 pages, 3 figures, submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[418]  arXiv:2309.13509 (cross-list from cs.SD) [pdf, other]
Title: Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control
Comments: Submitted to ASRU2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[419]  arXiv:2309.13544 (cross-list from cs.IR) [pdf, ps, other]
Title: Related Rhythms: Recommendation System To Discover Music You May Like
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[420]  arXiv:2309.13573 (cross-list from cs.SD) [pdf, other]
Title: The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR
Comments: 8 pages, Accepted by ASRU2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[421]  arXiv:2309.13860 (cross-list from cs.CL) [pdf, other]
Title: Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[422]  arXiv:2309.13876 (cross-list from cs.CL) [pdf, other]
Title: Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Comments: Accepted at ASRU 2023
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[423]  arXiv:2309.13907 (cross-list from cs.SD) [pdf, other]
Title: HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS
Comments: Accepted by ASRU2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[424]  arXiv:2309.13920 (cross-list from cs.SD) [pdf, ps, other]
Title: Real-Time Emergency Vehicle Detection using Mel Spectrograms and Regular Expressions
Comments: in Spanish language
Subjects: Sound (cs.SD); Formal Languages and Automata Theory (cs.FL); Symbolic Computation (cs.SC); Audio and Speech Processing (eess.AS)
[425]  arXiv:2309.13942 (cross-list from cs.CV) [pdf, other]
Title: Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Comments: Published at the CVPR 2023 Sight and Sound workshop
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[426]  arXiv:2309.13972 (cross-list from cs.SD) [pdf, ps, other]
Title: Audio classification with Dilated Convolution with Learnable Spacings
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[427]  arXiv:2309.14094 (cross-list from cs.SD) [pdf, other]
Title: VoiceLens: Controllable Speaker Generation and Editing with Flow
Authors: Yao Shi, Ming Li
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[428]  arXiv:2309.14130 (cross-list from cs.SD) [pdf, ps, other]
Title: On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers
Comments: accepted at ICASSP 2024
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[429]  arXiv:2309.14149 (cross-list from cs.SD) [pdf, other]
Title: Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification
Comments: submitted to ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[430]  arXiv:2309.14158 (cross-list from cs.SD) [pdf, other]
Title: An Investigation of Distribution Alignment in Multi-Genre Speaker Recognition
Comments: submitted to ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[431]  arXiv:2309.14372 (cross-list from cs.CL) [pdf, other]
Title: Human Transcription Quality Improvement
Comments: 5 pages, 3 figures, 5 tables, INTERSPEECH 2023
Journal-ref: INTERSPEECH 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[432]  arXiv:2309.14383 (cross-list from cs.SD) [pdf, ps, other]
Title: Towards using Cough for Respiratory Disease Diagnosis by leveraging Artificial Intelligence: A Survey
Comments: 30 pages, 12 figures, 9 tables
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[433]  arXiv:2309.14398 (cross-list from cs.LG) [pdf, other]
Title: Seeing and hearing what has not been said; A multimodal client behavior classifier in Motivational Interviewing with interpretable fusion
Comments: 9 pages, 7 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[434]  arXiv:2309.14405 (cross-list from cs.SD) [pdf, other]
Title: Joint Audio and Speech Understanding
Comments: Accepted at ASRU 2023. Code, dataset, and pretrained models are at this https URL Interactive demo at this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[435]  arXiv:2309.14586 (cross-list from cs.SD) [pdf, other]
Title: Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer
Comments: MICCAI 2023 (Oral presentation)
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[436]  arXiv:2309.14838 (cross-list from cs.SD) [pdf, other]
Title: Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
Comments: Accepted by ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[437]  arXiv:2309.15013 (cross-list from cs.CL) [pdf, other]
Title: Updated Corpora and Benchmarks for Long-Form Speech Recognition
Comments: Submitted to ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[438]  arXiv:2309.15024 (cross-list from cs.SD) [pdf, other]
Title: Synthia's Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[439]  arXiv:2309.15087 (cross-list from cs.CR) [pdf, other]
Title: Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey
Subjects: Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
[440]  arXiv:2309.15223 (cross-list from cs.CL) [pdf, other]
Title: Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages
Journal-ref: Proc. IEEE ASRU Workshop, Dec. 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[441]  arXiv:2309.15317 (cross-list from cs.CL) [pdf, other]
Title: Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning
Comments: Accepted to ASRU 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[442]  arXiv:2309.15512 (cross-list from cs.SD) [pdf, other]
Title: High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models
Comments: Accepted by ICASSP 2024. arXiv admin note: substantial text overlap with arXiv:2307.15484; text overlap with arXiv:2309.00424
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[443]  arXiv:2309.15554 (cross-list from cs.CL) [pdf, other]
Title: Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023
Comments: Published at IWSTL 2023
Journal-ref: Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[444]  arXiv:2309.15649 (cross-list from cs.CL) [pdf, other]
Title: Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting
Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version
Journal-ref: Proc. IEEE ASRU Workshop, Dec. 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[445]  arXiv:2309.15674 (cross-list from cs.SD) [pdf, other]
Title: Speech collage: code-switched audio generation by collaging monolingual corpora
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[446]  arXiv:2309.15686 (cross-list from cs.CL) [pdf, other]
Title: Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[447]  arXiv:2309.15701 (cross-list from cs.CL) [pdf, other]
Title: HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[448]  arXiv:2309.15800 (cross-list from cs.CL) [pdf, other]
Title: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study
Comments: Submitted to IEEE ICASSP 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[449]  arXiv:2309.15826 (cross-list from cs.CL) [pdf, other]
Title: Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[450]  arXiv:2309.15869 (cross-list from cs.CL) [pdf, other]
Title: Unsupervised Pre-Training for Vietnamese Automatic Speech Recognition in the HYKIST Project
Authors: Khai Le-Duc
Comments: Bachelor Thesis
Journal-ref: FH Aachen University of Applied Sciences (2023)
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[451]  arXiv:2309.15977 (cross-list from cs.SD) [pdf, other]
Title: Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[452]  arXiv:2309.16178 (cross-list from cs.SD) [pdf, other]
Title: LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR
Comments: Accepted to IEEE ASRU 2023
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[453]  arXiv:2309.16265 (cross-list from cs.SD) [pdf, other]
Title: Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
Comments: 5 pages, 3 figures. Accepted by ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[454]  arXiv:2309.16284 (cross-list from cs.SD) [pdf, other]
Title: NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment
Comments: Accepted for ICASSP 2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[455]  arXiv:2309.16287 (cross-list from cs.SD) [pdf, other]
Title: Predicting performance difficulty from piano sheet music images
Subjects: Sound (cs.SD); Digital Libraries (cs.DL); Audio and Speech Processing (eess.AS)
[456]  arXiv:2309.16308 (cross-list from cs.MM) [pdf, other]
Title: Audio Visual Speaker Localization from EgoCentric Views
Subjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[457]  arXiv:2309.16369 (cross-list from cs.SD) [pdf, other]
Title: Bringing the Discussion of Minima Sharpness to the Audio Domain: a Filter-Normalised Evaluation for Acoustic Scene Classification
Comments: This work has been submitted to the IEEE for possible publication
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[458]  arXiv:2309.16418 (cross-list from cs.SD) [pdf, other]
Title: Efficient Supervised Training of Audio Transformers for Music Representation Learning
Comments: Accepted at the 2023 International Society for Music Information Retrieval Conference (ISMIR'23)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[459]  arXiv:2309.16569 (cross-list from cs.SD) [pdf, other]
Title: Audio-Visual Speaker Verification via Joint Cross-Attention
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[460]  arXiv:2309.16937 (cross-list from cs.CL) [pdf, other]
Title: SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition
Comments: 5 pages, 2 figures. Accepted by ICME 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[461]  arXiv:2309.17056 (cross-list from cs.SD) [pdf, other]
Title: ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech
Comments: Accepted at ICASSP2024
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[462]  arXiv:2309.17125 (cross-list from cs.LG) [pdf, other]
Title: Style Transfer for Non-differentiable Audio Effects
Authors: Kieran Grant
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[463]  arXiv:2309.17189 (cross-list from cs.SD) [pdf, other]
Title: RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
Comments: Accepted by The Twelfth International Conference on Learning Representations (ICLR) 2024, see this https URL
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
[464]  arXiv:2309.17352 (cross-list from cs.SD) [pdf, other]
Title: Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
Comments: ICASSP 2024 camera-ready paper. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[465]  arXiv:2309.17395 (cross-list from cs.LG) [pdf, other]
Title: AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Comments: Under review
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[ total of 465 entries: 1-465 ]
[ showing 465 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, eess, 2406, contact, help  (Access key information)