Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Deng, Keqi; Yang, Zehui; Watanabe, Shinji; Higuchi, Yosuke; Cheng, Gaofeng; Zhang, Pengyuan

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2201

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Authors: Keqi Deng, Zehui Yang, Shinji Watanabe, Yosuke Higuchi, Gaofeng Cheng, Pengyuan Zhang

(Submitted on 25 Jan 2022 (v1), last revised 26 Jan 2022 (this version, v2))

Abstract: While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pre-trained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while keeping the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1% relative CER reduction on AISHELL-1). The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

Comments:	Accepted by ICASSP2022
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2201.10103 [eess.AS]
	(or arXiv:2201.10103v2 [eess.AS] for this version)

Submission history

From: Keqi Deng [view email]
[v1] Tue, 25 Jan 2022 05:40:55 GMT (1379kb,D)
[v2] Wed, 26 Jan 2022 06:36:20 GMT (1379kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2201.10103

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Submission history