Sound
New submissions
[ showing up to 1000 entries per page: fewer | more ]
New submissions for Tue, 23 Apr 24
- [1] arXiv:2404.13286 [pdf, other]
-
Title: Track Role Prediction of Single-Instrumental SequencesComments: ISMIR LBD 2023Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
In the composition process, selecting appropriate single-instrumental music sequences and assigning their track-role is an indispensable task. However, manually determining the track-role for a myriad of music samples can be time-consuming and labor-intensive. This study introduces a deep learning model designed to automatically predict the track-role of single-instrumental music sequences. Our evaluations show a prediction accuracy of 87% in the symbolic domain and 84% in the audio domain. The proposed track-role prediction methods hold promise for future applications in AI music generation and analysis.
- [2] arXiv:2404.13358 [pdf, other]
-
Title: Music Consistency ModelsSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (\texttt{MusicCM}), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the \texttt{MusicCM} model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, \texttt{MusicCM} achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.
- [3] arXiv:2404.13428 [pdf, ps, other]
-
Title: Text-dependent Speaker Verification (TdSV) Challenge 2024: Challenge Evaluation PlanSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
This document outlines the Text-dependent Speaker Verification (TdSV) Challenge 2024, which centers on analyzing and exploring novel approaches for text-dependent speaker verification. The primary goal of this challenge is to motive participants to develop single yet competitive systems, conduct thorough analyses, and explore innovative concepts such as multi-task learning, self-supervised learning, few-shot learning, and others, for text-dependent speaker verification.
- [4] arXiv:2404.13509 [pdf, ps, other]
-
Title: MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative AttentionComments: Main paper (5 pages). Accepted for publication by ICME 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.
- [5] arXiv:2404.13551 [pdf, other]
-
Title: AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognitionSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting their deployment on low-end edge devices. Motivated by the success of efficient vision models like InceptionNeXt and ConvNeXt, we propose AudioRepInceptionNeXt, a single-stream architecture. Its basic building block breaks down the parallel multi-branch depth-wise convolutions with descending scales of k x k kernels into a cascade of two multi-branch depth-wise convolutions. The first multi-branch consists of parallel multi-scale 1 x k depth-wise convolutional layers followed by a similar multi-branch employing parallel multi-scale k x 1 depth-wise convolutional layers. This reduces computational and memory footprint while separating time and frequency processing of Mel-Spectrograms. The large kernels capture global frequencies and long activities, while small kernels get local frequencies and short activities. We also reparameterize the multi-branch design during inference to further boost speed without losing accuracy. Experiments show that AudioRepInceptionNeXt reduces parameters and computations by 50%+ and improves inference speed 1.28x over state-of-the-art CNNs like the Slow-Fast while maintaining comparable accuracy. It also learns robustly across a variety of audio recognition tasks. Codes are available at https://github.com/StevenLauHKHK/AudioRepInceptionNeXt.
- [6] arXiv:2404.13568 [pdf, ps, other]
-
Title: Sparse Direction of Arrival Estimation Method Based on Vector Signal Reconstruction with a Single Vector SensorAuthors: Jiabin GuoComments: 20 pagesSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This study investigates the application of single vector hydrophones in underwater acoustic signal processing for Direction of Arrival (DOA) estimation. Addressing the limitations of traditional DOA estimation methods in multi-source environments and under noise interference, this research proposes a Vector Signal Reconstruction (VSR) technique. This technique transforms the covariance matrix of single vector hydrophone signals into a Toeplitz structure suitable for gridless sparse methods through complex calculations and vector signal reconstruction. Furthermore, two sparse DOA estimation algorithms based on vector signal reconstruction are introduced. Theoretical analysis and simulation experiments demonstrate that the proposed algorithms significantly improve the accuracy and resolution of DOA estimation in multi-source signals and low Signal-to-Noise Ratio (SNR) environments compared to traditional algorithms. The contribution of this study lies in providing an effective new method for DOA estimation with single vector hydrophones in complex environments, introducing new research directions and solutions in the field of vector hydrophone signal processing.
- [7] arXiv:2404.13569 [pdf, other]
-
Title: Musical Word Embedding for Music Tagging and RetrievalComments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, the word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. We integrate MWE into an audio-word joint representation framework for tagging and retrieving music, using words like tag, artist, and track that have different levels of musical specificity. Our experiments show that using a more specific musical word like track results in better retrieval performance, while using a less specific term like tag leads to better tagging performance. To balance this compromise, we suggest multi-prototype training that uses words with different levels of musical specificity jointly. We evaluate both word embedding and audio-word joint embedding on four tasks (tag rank prediction, music tagging, query-by-tag, and query-by-track) across two datasets (Million Song Dataset and MTG-Jamendo). Our findings show that the suggested MWE is more efficient and robust than the conventional word embedding.
- [8] arXiv:2404.13789 [pdf, other]
-
Title: Anchor-aware Deep Metric Learning for Audio-visual RetrievalComments: 9 pages, 5 figures. Accepted by ACM ICMR 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach.
- [9] arXiv:2404.13892 [pdf, other]
-
Title: Retrieval-Augmented Audio Deepfake DetectionComments: Accepted by the 2024 International Conference on Multimedia Retrieval (ICMR 2024)Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.
- [10] arXiv:2404.13914 [pdf, other]
-
Title: Audio Anti-Spoofing Detection: A SurveyComments: submitted to ACM Computing SurveysSubjects: Sound (cs.SD); Cryptography and Security (cs.CR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
The availability of smart devices leads to an exponential increase in multimedia content. However, the rapid advancements in deep learning have given rise to sophisticated algorithms capable of manipulating or creating multimedia fake content, known as Deepfake. Audio Deepfakes pose a significant threat by producing highly realistic voices, thus facilitating the spread of misinformation. To address this issue, numerous audio anti-spoofing detection challenges have been organized to foster the development of anti-spoofing countermeasures. This survey paper presents a comprehensive review of every component within the detection pipeline, including algorithm architectures, optimization techniques, application generalizability, evaluation metrics, performance comparisons, available datasets, and open-source availability. For each aspect, we conduct a systematic evaluation of the recent advancements, along with discussions on existing challenges. Additionally, we also explore emerging research topics on audio anti-spoofing, including partial spoofing detection, cross-dataset evaluation, and adversarial attack defence, while proposing some promising research directions for future work. This survey paper not only identifies the current state-of-the-art to establish strong baselines for future experiments but also guides future researchers on a clear path for understanding and enhancing the audio anti-spoofing detection mechanisms.
- [11] arXiv:2404.14063 [pdf, other]
-
Title: LVNS-RAVE: Diversified audio generation with RAVE and Latent Vector Novelty SearchComments: Accepted to GECCO 24 CompanionSubjects: Sound (cs.SD); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
Evolutionary Algorithms and Generative Deep Learning have been two of the most powerful tools for sound generation tasks. However, they have limitations: Evolutionary Algorithms require complicated designs, posing challenges in control and achieving realistic sound generation. Generative Deep Learning models often copy from the dataset and lack creativity. In this paper, we propose LVNS-RAVE, a method to combine Evolutionary Algorithms and Generative Deep Learning to produce realistic and novel sounds. We use the RAVE model as the sound generator and the VGGish model as a novelty evaluator in the Latent Vector Novelty Search (LVNS) algorithm. The reported experiments show that the method can successfully generate diversified, novel audio samples under different mutation setups using different pre-trained RAVE models. The characteristics of the generation process can be easily controlled with the mutation parameters. The proposed algorithm can be a creative tool for sound artists and musicians.
Cross-lists for Tue, 23 Apr 24
- [12] arXiv:2404.13140 (cross-list from quant-ph) [pdf, ps, other]
-
Title: Intro to Quantum Harmony: Chords in SuperpositionSubjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Correlations between quantum theory and music theory - specifically between principles of quantum computing and musical harmony - can lead to new understandings and new methodologies for music theorists and composers. The quantum principle of superposition is shown to be closely related to different interpretations of musical meaning. Superposition is implemented directly in the authors' simulations of quantum computing, as applied in the decision-making processes of computer-generated music composition.
- [13] arXiv:2404.13289 (cross-list from cs.CL) [pdf, other]
-
Title: Double Mixture: Towards Continual Event Detection from SpeechAuthors: Jingqi Kang, Tongtong Wu, Jinming Zhao, Guitao Wang, Yinwei Wei, Hao Yang, Guilin Qi, Yuan-Fang Li, Gholamreza HaffariComments: The first two authors contributed equally to this workSubjects: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.
- [14] arXiv:2404.13821 (cross-list from cs.HC) [pdf, other]
-
Title: Robotic Blended Sonification: Consequential Robot Sound as Creative Material for Human-Robot InteractionComments: Paper accepted at ISEA 24, The 29th International Symposium on Electronic Art, Brisbane, Australia, 21-29 June 2024Subjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Current research in robotic sounds generally focuses on either masking the consequential sound produced by the robot or on sonifying data about the robot to create a synthetic robot sound. We propose to capture, modify, and utilise rather than mask the sounds that robots are already producing. In short, this approach relies on capturing a robot's sounds, processing them according to contextual information (e.g., collaborators' proximity or particular work sequences), and playing back the modified sound. Previous research indicates the usefulness of non-semantic, and even mechanical, sounds as a communication tool for conveying robotic affect and function. Adding to this, this paper presents a novel approach which makes two key contributions: (1) a technique for real-time capture and processing of consequential robot sounds, and (2) an approach to explore these sounds through direct human-robot interaction. Drawing on methodologies from design, human-robot interaction, and creative practice, the resulting 'Robotic Blended Sonification' is a concept which transforms the consequential robot sounds into a creative material that can be explored artistically and within application-based studies.
Replacements for Tue, 23 Apr 24
- [15] arXiv:2309.03641 (replaced) [pdf, other]
-
Title: Spiking Structured State Space Model for Monaural Speech EnhancementSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
- [16] arXiv:2401.17738 (replaced) [pdf, other]
-
Title: Harnessing Smartwatch Microphone Sensors for Cough Detection and ClassificationComments: 7 pagesSubjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- [17] arXiv:2404.04739 (replaced) [pdf, other]
-
Title: Mathematics of the MML functional quantizer modules for VCV Rack software synthesizerAuthors: Maxwell Schneider, Cody McCarthy, Michael G. Maxwell, Joshua Pfeffer, Maxwell Schneider, Robert Schneider, Andrew V. SillsComments: 4 pages, submitted for publicationSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); History and Overview (math.HO)
- [18] arXiv:2404.09466 (replaced) [pdf, other]
-
Title: Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano TranscriptionComments: Fixed TyposSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- [19] arXiv:2305.12708 (replaced) [pdf, other]
-
Title: ViT-TTS: Visual Text-to-Speech with Scalable Diffusion TransformerAuthors: Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou ZhaoComments: Accepted by EMNLP 2023Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
- [20] arXiv:2403.04661 (replaced) [pdf, other]
-
Title: Dynamic Cross Attention for Audio-Visual Person VerificationComments: Accepted to FG2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
- [21] arXiv:2403.10488 (replaced) [pdf, other]
-
Title: Joint Multimodal Transformer for Emotion Recognition in the WildAuthors: Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric GrangerComments: 10 pages, 4 figures, 6 tables, CVPRw 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
- [22] arXiv:2403.10518 (replaced) [pdf, other]
-
Title: Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance PrimitivesAuthors: Ronghui Li, YuXiang Zhang, Yachao Zhang, Hongwen Zhang, Jie Guo, Yan Zhang, Yebin Liu, Xiu LiComments: Accepted by CVPR2024, Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
- [23] arXiv:2403.16973 (replaced) [pdf, other]
-
Title: VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the WildComments: Data, code, and model weights are available at this https URLSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
- [24] arXiv:2404.02000 (replaced) [pdf, ps, other]
-
Title: Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan ContextComments: To appear in AfricaNLP 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[ showing up to 1000 entries per page: fewer | more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, cs, recent, 2404, contact, help (Access key information)