An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Peng, Junyi; Plchot, Oldrich; Stafylakis, Themos; Mosner, Ladislav; Burget, Lukas; Cernocky, Jan

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2210

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Authors: Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky

(Submitted on 3 Oct 2022)

Abstract: In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and achieve SOTA performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H, respectively.

Comments:	Accepted by SLT2022
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2210.01273 [eess.AS]
	(or arXiv:2210.01273v1 [eess.AS] for this version)

Submission history

From: Junyi Peng [view email]
[v1] Mon, 3 Oct 2022 23:46:11 GMT (4020kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2210.01273

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Submission history