A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Wang, Yingzhi; Boumadane, Abdelmoumene; Heba, Abdelwahab

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2111

Computer Science > Computation and Language

Title: A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Authors: Yingzhi Wang, Abdelmoumene Boumadane, Abdelwahab Heba

(Submitted on 4 Nov 2021 (v1), last revised 3 Oct 2022 (this version, v3))

Abstract: Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, they have not been totally proven to produce better performance on tasks other than ASR. In this work, we explored partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks: Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. With simple proposed downstream frameworks, the best scores reached 79.58% weighted accuracy on speaker-dependent setting and 73.01% weighted accuracy on speaker-independent setting for Speech Emotion Recognition on IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 89.38% accuracy for Intent Classification and 78.92% F1 for Slot Filling on SLURP, showing the strength of fine-tuned wav2vec 2.0 and HuBERT on learning prosodic, voice-print and semantic representations.

Comments:	7 pages, 2 figures
Subjects:	Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2111.02735 [cs.CL]
	(or arXiv:2111.02735v3 [cs.CL] for this version)

Submission history

From: Yingzhi Wang [view email]
[v1] Thu, 4 Nov 2021 10:39:06 GMT (185kb)
[v2] Tue, 19 Apr 2022 12:59:44 GMT (384kb,D)
[v3] Mon, 3 Oct 2022 20:50:54 GMT (375kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2111.02735

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Submission history