Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Zhu, Xinfa; Li, Yuke; Lei, Yi; Jiang, Ning; Zhao, Guoqing; Xie, Lei

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2310

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Authors: Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

(Submitted on 26 Oct 2023 (this version), latest version 25 Apr 2024 (v2))

Abstract: This paper aims to build an expressive TTS system for multi-speakers, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, we construct positive-negative sample pairs at both utterance and category (such as emotion-happy or style-poet or speaker A) levels and leverage contrastive learning to better extract disentangled style, emotion, and speaker representations from speech. Furthermore, we introduce a semi-supervised training strategy to the proposed approach to effectively leverage multi-domain data, including style-labeled data, emotion-labeled data, and unlabeled data. We integrate the learned representations into an improved VITS model, enabling it to synthesize expressive speech with diverse styles and emotions for a target speaker. Experiments on multi-domain data demonstrate the good design of our model.

Comments:	5 pages, 3 figures
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2310.17101 [eess.AS]
	(or arXiv:2310.17101v1 [eess.AS] for this version)

Submission history

From: Xinfa Zhu [view email]
[v1] Thu, 26 Oct 2023 01:58:38 GMT (2171kb,D)
[v2] Thu, 25 Apr 2024 14:41:55 GMT (2660kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2310.17101v1

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Submission history