Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

Zhu, Han; Wang, Li; Wang, Jindong; Cheng, Gaofeng; Zhang, Pengyuan; Yan, Yonghong

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2110

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

Authors: Han Zhu, Li Wang, Jindong Wang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

(Submitted on 9 Oct 2021 (v1), last revised 17 Jun 2022 (this version, v2))

Abstract: Self-supervised pre-training could effectively improve the performance of low-resource automatic speech recognition (ASR). However, existing self-supervised pre-training are task-agnostic, i.e., could be applied to various downstream tasks. Although it enlarges the scope of its application, the capacity of the pre-trained model is not fully utilized for the ASR task, and the learned representations may not be optimal for ASR. In this work, in order to build a better pre-trained model for low-resource ASR, we propose a pre-training approach called wav2vec-S, where we use task-specific semi-supervised pre-training to refine the self-supervised pre-trained model for the ASR task thus more effectively utilize the capacity of the pre-trained model to generate task-specific representations for ASR. Experiments show that compared to wav2vec 2.0, wav2vec-S only requires a marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. Average relative WER reductions are 24.5% and 6.6% for 1h and 10h fine-tuning, respectively. Furthermore, we show that semi-supervised pre-training could close the representation gap between the self-supervised pre-trained model and the corresponding fine-tuned model through canonical correlation analysis.

Comments:	Accepted by Interspeech 2022
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2110.04484 [eess.AS]
	(or arXiv:2110.04484v2 [eess.AS] for this version)

Submission history

From: Han Zhu [view email]
[v1] Sat, 9 Oct 2021 07:09:22 GMT (162kb,D)
[v2] Fri, 17 Jun 2022 08:07:33 GMT (240kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2110.04484

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

Submission history