Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal

Peng, Zhiyuan; Feng, Siyuan; Lee, Tan

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 1911

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal

Authors: Zhiyuan Peng, Siyuan Feng, Tan Lee

(Submitted on 30 Oct 2019)

Abstract: Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level vector representation. A frame decoder serves to reconstruct speech features from the encoders'outputs. The mFAE is evaluated on speaker verification (SV) task and unsupervised subword modeling (USM) task. The SV experiments on VoxCeleb 1 show that the utterance embedder is capable of extracting speaker-discriminative embeddings with performance comparable to a x-vector baseline. The USM experiments on ZeroSpeech 2017 dataset verify that the frame tokenizer is able to capture linguistic content and the utterance embedder can acquire speaker-related information.

Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
Cite as:	arXiv:1911.01806 [eess.AS]
	(or arXiv:1911.01806v1 [eess.AS] for this version)

Submission history

From: Zhiyuan Peng [view email]
[v1] Wed, 30 Oct 2019 08:54:34 GMT (616kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:1911.01806

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal

Submission history