VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

Lu, Junchen; Sisman, Berrak; Liu, Rui; Zhang, Mingyang; Li, Haizhou

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2110

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

Authors: Junchen Lu, Berrak Sisman, Rui Liu, Mingyang Zhang, Haizhou Li

(Submitted on 7 Oct 2021 (v1), last revised 2 Mar 2022 (this version, v3))

Abstract: In this paper, we formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO). Unlike traditional speech synthesis, AVO seeks to generate not only human-sounding speech, but also perfect lip-speech synchronization. A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video. We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization. The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence. Experimental results show that VisualTTS achieves accurate lip-speech synchronization and outperforms all baseline systems.

Comments:	To appear at ICASSP 2022
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
Cite as:	arXiv:2110.03342 [eess.AS]
	(or arXiv:2110.03342v3 [eess.AS] for this version)

Submission history

From: Junchen Lu [view email]
[v1] Thu, 7 Oct 2021 11:25:25 GMT (1464kb,D)
[v2] Sat, 9 Oct 2021 12:03:35 GMT (1464kb,D)
[v3] Wed, 2 Mar 2022 13:55:36 GMT (1477kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2110.03342

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

Submission history