Chunked Autoregressive GAN for Conditional Waveform Synthesis

Morrison, Max; Kumar, Rithesh; Kumar, Kundan; Seetharaman, Prem; Courville, Aaron; Bengio, Yoshua

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2110

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Chunked Autoregressive GAN for Conditional Waveform Synthesis

Authors: Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, Yoshua Bengio

(Submitted on 19 Oct 2021 (v1), last revised 3 Mar 2022 (this version, v2))

Abstract: Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of-the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality.

Comments:	Published as a conference paper at ICLR 2022
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2110.10139 [eess.AS]
	(or arXiv:2110.10139v2 [eess.AS] for this version)

Submission history

From: Max Morrison [view email]
[v1] Tue, 19 Oct 2021 17:48:12 GMT (3250kb,D)
[v2] Thu, 3 Mar 2022 23:05:26 GMT (4444kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2110.10139

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Chunked Autoregressive GAN for Conditional Waveform Synthesis

Submission history