AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Shi, Yao; Bu, Hui; Xu, Xin; Zhang, Shaoji; Li, Ming

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2010

Computer Science > Sound

Title: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Authors: Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li

(Submitted on 22 Oct 2020 (v1), last revised 22 Apr 2021 (this version, v2))

Abstract: In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.

Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2010.11567 [cs.SD]
	(or arXiv:2010.11567v2 [cs.SD] for this version)

Submission history

From: Yao Shi [view email]
[v1] Thu, 22 Oct 2020 09:54:22 GMT (986kb,D)
[v2] Thu, 22 Apr 2021 07:51:51 GMT (986kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2010.11567

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Sound

Title: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Submission history