Anatomy of Industrial Scale Multilingual ASR

Ramirez, Francis McCann; Chkhetiani, Luka; Ehrenberg, Andrew; McHardy, Robert; Botros, Rami; Khare, Yash; Vanzo, Andrea; Peyash, Taufiquzzaman; Oexle, Gabriel; Liang, Michael; Sklyar, Ilya; Fakhan, Enver; Etefy, Ahmed; McCrystal, Daniel; Flamini, Sam; Donato, Domenic; Yoshioka, Takuya

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2404

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Anatomy of Industrial Scale Multilingual ASR

Authors: Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang, Ilya Sklyar, Enver Fakhan, Ahmed Etefy, Daniel McCrystal, Sam Flamini, Domenic Donato, Takuya Yoshioka

(Submitted on 15 Apr 2024 (v1), last revised 16 Apr 2024 (this version, v2))

Abstract: This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2404.09841 [eess.AS]
	(or arXiv:2404.09841v2 [eess.AS] for this version)

Submission history

From: Luka Chkhetiani [view email]
[v1] Mon, 15 Apr 2024 14:48:43 GMT (1549kb,D)
[v2] Tue, 16 Apr 2024 14:55:13 GMT (1545kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2404.09841

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Anatomy of Industrial Scale Multilingual ASR

Submission history