Current browse context:
eess
Change to browse by:
References & Citations
Computer Science > Machine Learning
Title: Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech
(Submitted on 28 Nov 2019 (v1), last revised 17 Feb 2020 (this version, v2))
Abstract: We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).
Submission history
From: Vatsal Aggarwal [view email][v1] Thu, 28 Nov 2019 15:57:14 GMT (988kb,D)
[v2] Mon, 17 Feb 2020 13:56:04 GMT (988kb,D)
Link back to: arXiv, form interface, contact.