Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

Wang, Yutian; Xie, Yuankun; Zhao, Kun; Wang, Hui; Zhang, Qin

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2204

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

Authors: Yutian Wang, Yuankun Xie, Kun Zhao, Hui Wang, Qin Zhang

(Submitted on 7 Apr 2022)

Abstract: In this paper, we propose a novel prosody disentangle method for prosodic Text-to-Speech (TTS) model, which introduces the vector quantization (VQ) method to the auxiliary prosody encoder to obtain the decomposed prosody representations in an unsupervised manner. Rely on its advantages, the speaking styles, such as pitch, speaking velocity, local pitch variance, etc., are decomposed automatically into the latent quantize vectors. We also investigate the internal mechanism of VQ disentangle process by means of a latent variables counter and find that higher value dimensions usually represent prosody information. Experiments show that our model can control the speaking styles of synthesis results by directly manipulating the latent variables. The objective and subjective evaluations illustrated that our model outperforms the popular models.

Comments:	accepted by IEEE International Conference on Multimedia and Expo 2022 (ICME2022)
Subjects:	Audio and Speech Processing (eess.AS); Multimedia (cs.MM)
Cite as:	arXiv:2204.03238 [eess.AS]
	(or arXiv:2204.03238v1 [eess.AS] for this version)

Submission history

From: Yutian Wang [view email]
[v1] Thu, 7 Apr 2022 06:09:47 GMT (1936kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2204.03238

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

Submission history