Very Deep Self-Attention Networks for End-to-End Speech Recognition

Pham, Ngoc-Quan; Nguyen, Thai-Son; Niehues, Jan; Müller, Markus; Stüker, Sebastian; Waibel, Alexander

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 1904

Computer Science > Computation and Language

Title: Very Deep Self-Attention Networks for End-to-End Speech Recognition

Authors: Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, Alexander Waibel

(Submitted on 30 Apr 2019 (v1), last revised 3 May 2019 (this version, v2))

Abstract: Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure.

Comments:	Submitted to INTERSPEECH 2019
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:1904.13377 [cs.CL]
	(or arXiv:1904.13377v2 [cs.CL] for this version)

Submission history

From: Ngoc Quan Pham [view email]
[v1] Tue, 30 Apr 2019 17:20:32 GMT (186kb,D)
[v2] Fri, 3 May 2019 14:00:16 GMT (186kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1904.13377

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Very Deep Self-Attention Networks for End-to-End Speech Recognition

Submission history