EfficientSpeech: An On-Device Text to Speech Model

Atienza, Rowel

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2305

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: EfficientSpeech: An On-Device Text to Speech Model

Authors: Rowel Atienza

(Submitted on 23 May 2023)

Abstract: State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. These models are characterized by large memory footprints and substantial number of operations due to the long-standing focus on speech quality with cloud inference in mind. Neural TTS models are generally not designed to perform standalone speech syntheses on resource-constrained and no Internet access edge devices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed. EfficientSpeech uses a shallow non-autoregressive pyramid-structure transformer forming a U-Network. EfficientSpeech has 266k parameters and consumes 90 MFLOPS only or about 1% of the size and amount of computation in modern compact models such as Mixer-TTS. EfficientSpeech achieves an average mel generation real-time factor of 104.3 on an RPi4. Human evaluation shows only a slight degradation in audio quality as compared to FastSpeech2.

Comments:	To be presented at ICASSP 2023
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2305.13905 [eess.AS]
	(or arXiv:2305.13905v1 [eess.AS] for this version)

Submission history

From: Rowel Atienza [view email]
[v1] Tue, 23 May 2023 10:28:41 GMT (146kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2305.13905

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: EfficientSpeech: An On-Device Text to Speech Model

Submission history