CAMP: a Two-Stage Approach to Modelling Prosody in Context

Hodari, Zack; Moinet, Alexis; Karlapati, Sri; Lorenzo-Trueba, Jaime; Merritt, Thomas; Joly, Arnaud; Abbas, Ammar; Karanasou, Penny; Drugman, Thomas

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2011

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: CAMP: a Two-Stage Approach to Modelling Prosody in Context

Authors: Zack Hodari, Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman

(Submitted on 2 Nov 2020 (v1), last revised 12 Feb 2021 (this version, v2))

Abstract: Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

Comments:	5 pages. Published in the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2011.01175 [eess.AS]
	(or arXiv:2011.01175v2 [eess.AS] for this version)

Submission history

From: Zack Hodari [view email]
[v1] Mon, 2 Nov 2020 18:14:57 GMT (171kb,D)
[v2] Fri, 12 Feb 2021 11:27:42 GMT (455kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2011.01175

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: CAMP: a Two-Stage Approach to Modelling Prosody in Context

Submission history