Current browse context:
cs.SD
Change to browse by:
References & Citations
Electrical Engineering and Systems Science > Audio and Speech Processing
Title: VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
(Submitted on 15 Jul 2021)
Abstract: In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.
Link back to: arXiv, form interface, contact.