Current browse context:
eess.AS
Change to browse by:
References & Citations
Electrical Engineering and Systems Science > Audio and Speech Processing
Title: Fine-grained style control in Transformer-based Text-to-speech Synthesis
(Submitted on 12 Oct 2021 (v1), last revised 16 Mar 2022 (this version, v2))
Abstract: In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.
Submission history
From: Li-Wei Chen [view email][v1] Tue, 12 Oct 2021 19:50:02 GMT (1830kb,D)
[v2] Wed, 16 Mar 2022 20:46:46 GMT (1843kb,D)
Link back to: arXiv, form interface, contact.