DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Ling, Shaoshi; Liu, Yuzong

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2012

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Authors: Shaoshi Ling, Yuzong Liu

(Submitted on 11 Dec 2020)

Abstract: Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to learn a feature representation. Then a smaller amount of labeled data is used to train a downstream ASR system using the new feature representations. Based on our previous work DeCoAR and inspirations from other speech representation learning, we propose DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We introduce several modifications over the DeCoAR: first, we use Transformers in encoding module instead of LSTMs; second, we introduce a vector quantization layer between encoder and reconstruction modules; third, we propose an objective that combines the reconstructive loss with vector quantization diversity loss to train speech representations. Our experiments show consistent improvements over other speech representations in different data-sparse scenarios. Without fine-tuning, a light-weight ASR model trained on 10 hours of LibriSpeech labeled data with DeCoAR 2.0 features outperforms the model trained on the full 960-hour dataset with filterbank features.

Comments:	Submitted to ICASSP 2021
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2012.06659 [eess.AS]
	(or arXiv:2012.06659v1 [eess.AS] for this version)

Submission history

From: Yuzong Liu [view email]
[v1] Fri, 11 Dec 2020 22:07:23 GMT (1421kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2012.06659

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Submission history