High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

Sheng, Leyuan; Huang, Dong-Yan; Pavlovskiy, Evgeniy N.

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 1912

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

Authors: Leyuan Sheng, Dong-Yan Huang, Evgeniy N. Pavlovskiy

(Submitted on 3 Dec 2019)

Abstract: In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.

Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:1912.01167 [eess.AS]
	(or arXiv:1912.01167v1 [eess.AS] for this version)

Submission history

From: Leyuan Sheng [view email]
[v1] Tue, 3 Dec 2019 02:53:54 GMT (1462kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:1912.01167

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

Submission history