Current browse context:
eess.AS
Change to browse by:
References & Citations
Electrical Engineering and Systems Science > Audio and Speech Processing
Title: Karaoker: Alignment-free singing voice synthesis with speech training data
(Submitted on 8 Apr 2022 (v1), last revised 31 Aug 2022 (this version, v2))
Abstract: Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.
Submission history
From: Panos Kakoulidis [view email][v1] Fri, 8 Apr 2022 15:33:59 GMT (155kb,D)
[v2] Wed, 31 Aug 2022 08:44:07 GMT (156kb,D)
Link back to: arXiv, form interface, contact.