RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

Jung, Jee-weon; Heo, Hee-Soo; Kim, Ju-ho; Shim, Hye-jin; Yu, Ha-Jin

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 1904

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

Authors: Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, Ha-Jin Yu

(Submitted on 17 Apr 2019 (v1), last revised 17 Jul 2019 (this version, v2))

Abstract: Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains. In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation. In this study, we explore end-to-end deep neural networks that input raw waveforms to improve various aspects: front-end speaker embedding extraction including model architecture, pre-training scheme, additional objective functions, and back-end classification. Adjustment of model architecture using a pre-training scheme can extract speaker embeddings, giving a significant improvement in performance. Additional objective functions simplify the process of extracting speaker embeddings by merging conventional two-phase processes: extracting utterance-level features such as i-vectors or x-vectors and the feature enhancement phase, e.g., linear discriminant analysis. Effective back-end classification models that suit the proposed speaker embedding are also explored. We propose an end-to-end system that comprises two deep neural networks, one front-end for utterance-level speaker embedding extraction and the other for back-end classification. Experiments conducted on the VoxCeleb1 dataset demonstrate that the proposed model achieves state-of-the-art performance among systems without data augmentation. The proposed system is also comparable to the state-of-the-art x-vector system that adopts data augmentation.

Comments:	Accepted for oral presentation at Interspeech 2019, code available at this http URL
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:1904.08104 [eess.AS]
	(or arXiv:1904.08104v2 [eess.AS] for this version)

Submission history

From: Jee-Weon Jung [view email]
[v1] Wed, 17 Apr 2019 06:37:22 GMT (29kb)
[v2] Wed, 17 Jul 2019 03:52:16 GMT (29kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:1904.08104

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

Submission history