We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

eess.AS

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

Abstract: In this paper, we propose a novel prosody disentangle method for prosodic Text-to-Speech (TTS) model, which introduces the vector quantization (VQ) method to the auxiliary prosody encoder to obtain the decomposed prosody representations in an unsupervised manner. Rely on its advantages, the speaking styles, such as pitch, speaking velocity, local pitch variance, etc., are decomposed automatically into the latent quantize vectors. We also investigate the internal mechanism of VQ disentangle process by means of a latent variables counter and find that higher value dimensions usually represent prosody information. Experiments show that our model can control the speaking styles of synthesis results by directly manipulating the latent variables. The objective and subjective evaluations illustrated that our model outperforms the popular models.
Comments: accepted by IEEE International Conference on Multimedia and Expo 2022 (ICME2022)
Subjects: Audio and Speech Processing (eess.AS); Multimedia (cs.MM)
Cite as: arXiv:2204.03238 [eess.AS]
  (or arXiv:2204.03238v1 [eess.AS] for this version)

Submission history

From: Yutian Wang [view email]
[v1] Thu, 7 Apr 2022 06:09:47 GMT (1936kb,D)

Link back to: arXiv, form interface, contact.