GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Huang, Rongjie; Ren, Yi; Liu, Jinglin; Cui, Chenye; Zhao, Zhou

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2205

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Authors: Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

(Submitted on 15 May 2022 (v1), last revised 12 Oct 2022 (this version, v2))

Abstract: Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting. Audio samples are available at this https URL

Comments:	Accepted to NeurIPS 2022
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2205.07211 [eess.AS]
	(or arXiv:2205.07211v2 [eess.AS] for this version)

Submission history

From: Rongjie Huang [view email]
[v1] Sun, 15 May 2022 08:16:02 GMT (6605kb,D)
[v2] Wed, 12 Oct 2022 13:59:20 GMT (7714kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2205.07211

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Submission history