Current browse context:
eess.AS
Change to browse by:
References & Citations
Electrical Engineering and Systems Science > Audio and Speech Processing
Title: Word Order Does Not Matter For Speech Recognition
(Submitted on 12 Oct 2021 (v1), last revised 18 Oct 2021 (this version, v2))
Abstract: In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.
Submission history
From: Vineel Pratap [view email][v1] Tue, 12 Oct 2021 13:35:01 GMT (2435kb,D)
[v2] Mon, 18 Oct 2021 19:04:13 GMT (2415kb,D)
Link back to: arXiv, form interface, contact.