Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Bae, Jae-Sung; Bak, Tae-Jun; Joo, Young-Sun; Cho, Hoon-Young

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2106

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Authors: Jae-Sung Bae, Tae-Jun Bak, Young-Sun Joo, Hoon-Young Cho

(Submitted on 29 Jun 2021)

Abstract: In this paper, we propose methods for improving the modeling performance of a Transformer-based non-autoregressive text-to-speech (TNA-TTS) model. Although the text encoder and audio decoder handle different types and lengths of data (i.e., text and audio), the TNA-TTS models are not designed considering these variations. Therefore, to improve the modeling performance of the TNA-TTS model we propose a hierarchical Transformer structure-based text encoder and audio decoder that are designed to accommodate the characteristics of each module. For the text encoder, we constrain each self-attention layer so the encoder focuses on a text sequence from the local to the global scope. Conversely, the audio decoder constrains its self-attention layers to focus in the reverse direction, i.e., from global to local scope. Additionally, we further improve the pitch modeling accuracy of the audio decoder by providing sentence and word-level pitch as conditions. Various objective and subjective evaluations verified that the proposed method outperformed the baseline TNA-TTS.

Comments:	Accepted to INTERSPEECH 2021
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2106.15144 [eess.AS]
	(or arXiv:2106.15144v1 [eess.AS] for this version)

Submission history

From: Jaesung Bae [view email]
[v1] Tue, 29 Jun 2021 08:05:11 GMT (699kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2106.15144

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Submission history